Recognition of differential culture conditions with dimensional reduction approach

Microorganisms constantly modify their gene expression and metabolic profiles in response to alterations in their surrounding environment. Monitoring these changes is crucial for regulating microbial production of substances. However, it remains challenging to identify differential culture conditions through the extraction of differentially expressed genes and clustering of gene expression profiles. In this study, we employed a dimensionality reduction technique for yeast gene expression data obtained under multiple culture conditions to visualize discrepancies among culture conditions. Our findings indicate that this approach is effective in identifying multiple culture conditions.


Introduction
Environmental factors, including medium composition [1], oxygen concentration [2], and pH [3], are external to the material production process by microorganisms.These factors can impact the growth and material production of microorganisms, which constantly change their metabolism and gene expression levels in response to these external stimuli.The gene expression levels of microorganisms can also serve as indicators of environmental changes and changes in the microorganisms' own state.In fact, some genes possessed by microorganisms are stress-responsive genes that respond to changes in external conditions [4].By analyzing genes with specific profiles under certain conditions, it is possible to infer the environment in which the microorganism is located.
Several studies have used expression variation genes to differentiate between the surrounding environment and culture conditions of microorganisms [5] [6].Here, we propose an approach that combines machine learning and dimensionality reduction techniques to visually distinguish between conditions from gene expression data.Our method could be useful for the visual recognition of yeast gene expression data across multiple culture conditions.

Gene expression data in various culture conditions
To use gene expression data under various culture conditions, we obtained gene expression patterns of yeast under different culture conditions in a previous study [7].The data were obtained through a DNA microarray, which captured changes in the transcriptome in response to various environmental stresses.Each gene recorded in these data was annotated using the Saccharomyces Genome Database (SGD) [8].To facilitate comparison between samples, each gene expression level was normalized to a mean of 0 and variance of 1 for each sample.

Extraction of condition-specific expressed genes
To extract genes with different expression patterns across conditions, we searched for condition-specific genes using Shannon's entropy [9].Entropy H is defined as follows: where v is the v-th gene, m is the m-th condition and N is the number of conditions.For each gene expression pattern, entropy H was calculated, and genes with lower entropy were defined as conditionspecific gene expression.In this study, 30 genes with low entropy were selected as candidate genes that showed different expression patterns under different conditions.

Dimensional reduction process by LDA
Linear discriminant analysis (LDA) [10] was used as the dimensionality reduction approach in this study.This approach projects the original data matrix into a lower-dimensional space.This method consists of three major steps.First, the distance between the means of the different classes, called the between-class variance, was computed.Then, the distance between the mean and sample for each class, called the within-class variance, was computed.Finally, within-class variance was minimized by maximizing between-class variance.In other words, the features are transformed into a low-dimensional space that maximizes the ratio of within-class variance to within-class variance.
When the number of classes is K, a linear map that maps d(>K) dimensional data to a K-1 dimensional space is represented as follows: where w is a vector,  = ( ' , … ,  +,' ) * ,  = ( ' , … ,  +,' ) and K-1 linear transformation is described as follows: where the within-class variation  - 8 , between-class  .8 , and total variation  * 8 after linear transformation are defined as: where mk is the mean vector for each class and m is the mean vector.Finally, the ratio of the betweenclass variance to the within-class variance is maximized and the mapping matrix W is obtained.

Mapping to two-dimensional space
To map each sample against a two-dimensional space, we performed principal component analysis (PCA) as a linear mapping approach.If the eigenvalues are λ and eigenvectors are h, we solve the following eigenequations: where S' is the variance-covariance matrix of Z', I is the identity matrix.The obtained eigenvalues λand eigenvectors h are as follows: Therefore, the principal components are represented as follows: We also used t-distributed Stochastic Neighbor Embedding (t-SNE) [11], a nonlinear dimensionality reduction method, to visualize each condition.The t-SNE is designed to project high-dimensional data into a low-dimensional space with minimal structural information loss.The t-SNE method is designed to project high-dimensional data into a low-dimensional space with minimal structural information loss, and points that are close to each other on a low-dimensional surface are in a similar state in a highdimensional space.
In this approach, the high-dimensional Euclidean distances between data points are interpreted as simultaneous probability distributions, and if the simultaneous probability between data points in the original space is pij, the simultaneous probability after compression is qij, and the data points after compression are zi, then the following is represented: To mitigate the crowding problem, t-SNE uses a Student's t-distribution with one degree of freedom for the probability distribution after compression.The simultaneous probability distribution qij after compression is expressed as follows: tSNE minimizes the sum of the KL-divergences of the simultaneous probability distributions pij and qij for the following cost function C:

Hierarchical clustering by gene expression levels
Hierarchical clustering was performed using 30 gene expression datasets that showed low entropy.The selected genes were divided into two major clusters, as shown in Figure 1.They were not necessarily clustered close together under conditions other than the stationary phase (stationaryPhase) of culture.In the gene expression data during heat shock, clustering was divided according to the temperature conditions.For example, the same conditions were clustered with each other in the condition shifted to low temperature from 37°C to 25°C and inversely in the condition shifted to high temperatures from 25°C to 37°C.

Mapping to two-dimensional space
We attempted to map gene expression data to a two-dimensional space to distinguish differences among conditions.In this study, visualization in a two-dimensional space was performed using Principal Component Analysis, one of the most common methods in multivariate data visualization, and tdistributed Stochastic Neighbor Embedding, a nonlinear dimensionality reduction method.
In the visualization results for all genes, the trends differed between principal component analysis and t-SNE.In principal component analysis, except for the stationary phase and diamide treatment, the two components did not seem to cluster well together, whereas in t-SNE, in addition to the stationary phase and diamide treatment conditions, we found that the conditions clustered well on the horizontal and vertical axes.However, in the hyper-osmotic and steady-state conditions, the distribution was scattered from sample to sample.
After dimensionality reduction by LDA, PCA was more clustered by condition than when using all genes or selected genes.However, conditions that were well separated and clusters that were difficult to distinguish were observed.In addition, samples from the menadione bisulfite exposure mapped farther apart from the other conditions.The combination of the LDA and T-SNE results was most clearly distinguishable under various conditions.Except for the diamide treatment and Mild Heat Shock, the samples from the same conditions were mapped in proximity.

Discussion
PCA and t-SNE using all genes made it difficult to distinguish between the different conditions.When all genes were used, the overall genetic variation might have been captured rather than condition-specific variation.In t-SNE, we identified some conditions that were well organized in the horizontal and vertical axis directions, but there were also several samples with identical conditions that were widely separated in a two-dimensional space.As a result, without variable selection or dimensionality reduction, the samples did not successfully recognize the different conditions in a two-dimensional space.
We selected 30 condition-specific genes using an entropy-based gene selection approach.Hierarchical clustering based on these genes did not appear to be necessarily clustered close together in conditions other than the stationary phase, but in the gene expression data during heat shock, clustering was separated by temperature conditions.PCA using the 30 selected genes did not cluster well by condition, although the conditions appeared to be aligned along the first principal component.In contrast, t-SNE had difficulty recognizing hyper-osmotic and steady-states, although it was easier to distinguish between conditions than for all genes.This suggests that some of the selected genes showed a certain pattern in the conditions but could not successfully distinguish between conditions in all conditions.
The approach we proposed in this study of dimensionality reduction with LDA facilitated the recognition of each condition better than with whole or selected genes.We were able to recognize each condition most clearly when LDA was combined with the T-SNE.These results suggest that our method is an effective approach for two-dimensional interpretation of several different conditions from gene expression data.

Figure 1 .
Figure 1.Heat map and dendrogram of the gene expression data.This figure shows the results of clustering using Ward's method based on 30 genes with low entropy.The columns represent the gene names and the rows specify the respective culture conditions.

Figure 3 .
Figure 3. PCA and tSNE using the selected genes.