Toward Model Compression for a Deep Learning–Based Solar Flare Forecast on Satellites

Timely solar flare forecasting is challenged by the delay of transmitting vast amounts of data from the satellite to the ground. To avoid this delay, it is expected that forecasting models will be deployed on satellites. Thus, transmitting forecasting results instead of huge volumes of observation data would greatly save network bandwidth and reduce forecasting delay. However, deep-learning models have a huge number of parameters so they need large memory and strong computing power, which hinders their deployment on satellites with limited memory and computing resources. Therefore, there is a great need to compress forecasting models for efficient deployment on satellites. First, three typical compression methods, namely knowledge distillation, pruning, and quantization, are examined individually for compressing of solar flare forecasting models. And then, an assembled compression model is proposed for better compressing solar flare forecasting models. The experimental results demonstrate that the assembled compression model can compress a pretrained solar flare forecasting model to only 1.67% of its original size while maintaining forecasting accuracy.


Introduction
Solar activities, such as solar flares, solar energetic particle events, and coronal mass ejections, have a strong influence on the space environment.Among these, solar flares are the most violent electromagnetic radiation across the entire spectrum and have the potential to threaten satellites, radio communications, and power grids.Consequently, solar flare forecasting has received much attention.A solar flare forecasting method extracts current and historical information from observed solar data to build a predictor for predicting future solar activity.Data for this can be obtained from several well-known solar satellites, such as the Solar Dynamics Observatory (SDO), Solar and Heliospheric Observatory (SOHO), and Solar Terrestrial Relations Observatory (STEREO).However, existing methods require vast amounts of observational data to be transmitted from solar satellites to Earth before they can be analyzed by the forecasting models, compromising the timeliness of solar flare forecasts.
This work explores the implementation of solar flare forecasting on satellites, where forecasting is conducted on satellites, and the forecasting results are directly transmitted to the ground, avoiding mass data transmission.To deploy a forecasting model on satellites, two requirements must be considered.First, the model must maintain high forecasting accuracy and robustness in various scenarios once it has been deployed on satellites.Second, the model must be lightweight as the computing capability on board the satellite is limited.
In the past, forecasting models were developed using statistical methods and machine learning.In recent years, deep learning has been extensively explored in solar flare forecasting models inspired by its great success in computer vision, natural language processing, etc. Machine learning-based solar flare forecasting (Huang et al. 2012;Huang & Wang 2013;Huang et al. 2013;Florios et al. 2018;Lavasa et al. 2021;Ribeiro & Gradvohl 2021) mainly used support vector machines (SVMs), multilayer perceptrons (MLPs), and random forests (RFs) to establish forecasting models.The polarity inverse line (PIL) mask, which encloses the PIL areas, has significant potential for improving machine learning-based flare-forecasting models (Wang et al. 2020).The kernel principle component analysis (KPCA) algorithm was adopted to extract features from the PIL mask and difference PIL mask, and classify active regions (ARs) with these features.
For the first time, Huang et al. (2018) applied a convolutional neural network (CNN), a deep-learning model, to flare forecasting.They directly fed magnetograms of solar ARs to the model without extracting physical parameters, achieving a large improvement in forecasting accuracy.CNNs have a great ability for image feature extraction (Bhattacharjee et al. 2020).Spatial features from the sunspot group images were extracted using a CNN in Abed et al. (2021) as the input of the Softmax classifier, and the output was whether a flare occurred or not.In Deshmukh et al. (2022), VGG-16, a CNN architecture, was introduced as the feature extractor of magnetogram to solve the problem of false positives in solar flare forecasting.Solar flare forecasting usually only concerns the occurrence of flares.However, flares have multiple classes (C, M, X) in terms of their intensities.Zheng et al. (2019) used a CNN to make multiple forecasts of solar flares.In addition, in order to solve the problem of unbalanced flare samples, Deng et al. (2021) used a generative adversarial network (GAN) to enhance the sample data and proposed a hybrid CNN model for flare forecast.Nishizuka et al. (2018) proposed a flareforecasting model by employing a deep neural network (DNN).This forecasting model calculated the probability of a flare occurrence and fed this probability to a binary classifier to determine the class of the flare.Liu et al. (2019) proposed a forecasting model based on the long short-term memory (LSTM) network to capture temporal information of the data samples.The LSTM has been used widely in other works (Yi et al. 2020;Ahmadzadeh et al. 2021;Sun et al. 2022).In recent works (Tang et al. 2021;Chen et al. 2022), mixed models combining boosting and a CNN were used to forecast flares and classify sunspot groups.
When a forecasting model is deployed on a satellite, it can directly implement the forecasting task without the need of transmitting a huge volume of observation data on the satellite to the ground, which can improve the real-time capability of forecasting.The forecasting result is directly transmitted to the ground, saving network bandwidth significantly.However, present deep learning-based flare-forecasting models, without considering the limitations of computing and storage resources on satellites, are not suitable for deployment.Therefore, model compression is required before a deep-learning model is deployed on a satellite.In the literature on model compression, knowledge disillation (KD; Hinton et al. 2015), pruning (Hu et al. 2018), and quantization (Wu et al. 2016) have been well and widely investigated.
In this paper, we investigate all compression methods on our concerned solar flare forecasting model separately.In addition, we propose an assembled compression framework integrating multiple compression methods to compress the flare-forecasting model.Our contributions are threefold as follows: 1. We first investigate the compression of deep-learning models in the field of solar flare forecasting.2. We verify the effectiveness of knowledge distillation, pruning, and quantization methods for compressing solar flare forecasting models.3.An assembled compression framework for compressing solar flare forecasting models is proposed and evaluated on a large-scale database of solar flare.
The rest of this paper is organized as follows: Section 2 describes the database.Section 3 introduces the three compression methods.The implementation and evaluation of the proposed compressed framework are presented in Section 4. Conclusions and future work are provided in Section 5.

Data
For model training, a large-scale data set for solar flare forecasting should be provided.In our previous work (Huang et al. 2012), we have already established a large-scale data set.The data set consists of line-of-sight magnetograms of ARs from both SOHO/Michelson Doppler Imager (MDI) and SDO/Helioseismic and Magnetic Imager (HMI), ranging from 1996 January 1 to 2015 October 1. SOHO/MDI and SDO/ HMI provide the continuous and high-quality photospheric magnetic field observations.SOHO/MDI began its routine observations on 1996 January 1 and terminated on 2011 April 12. SDO/HMI, as a successor of SOHO/MDI, began its routine observations from 2010 April 30.The line-of-sight magnetograms of ARs can be obtained from the tracked AR patch data products on the Joint Science Operations Center (JSOC) database. 4To avoid the projection effects of magnetic filed observation happening near the edge of the Sun, only the ARs whose central longitudes located within ±30°of the central meridian are included in the data set.For the consistency of spatiotemporal resolution, SDO/HMI was downsampled to the same resolution as SOHO/MDI with the cadence of 96 minutes and spatial resolution of 2′.
In this work, only SDO/HMI data with the cadence of 96 minutes and spatial resolution of 0 5 are used.In addition, the data set is extended from 2015 October 1 to 2019 January 26.Thus, the data set contains 70822 nonflaring samples (negative samples) and 2988 flaring samples (positive samples) in total.The data distributed in chronological order are shown in Table 1.The 24th solar cycle began in 2008 and ended in 2019.Its maximum year was in 2014 (with two peaks in 2012.2 and 2014.4) while 2008 and 2019 were two minimum years.The Sun was much more active in maximum years than in minimum years.Thus, there were more ARs and solar flares around 2014 than in other years.It should be noted that both flares and nonflares come from ARs.Therefore, we have more both positive and negative samples in the years around 2014 than in other years in Table 1.Note that the sample here is the AR, not the full disk image.
Solar flares are classified into A, B, C, M, and X types according to the peak magnitude of the soft X-ray flux observed by the Geostationary Operational Environmental Satellite system.Usually, M-level flares and above are taken as positive samples while others are negative samples for solar flare forecasting.The solar flare data are downloaded from https://www.ngdc.noaa.gov/stp/space-weather/solar-data/solar-features/solar-flares/x-rays/goes/ xrs/.A magnetogram of an AR is labeled as a flare (positive sample) if at least one flare occurs in this AR within a given period of time (48 hr in this work) for flare forecasting.Otherwise, it is considered to be a nonflare (negative sample).

Method
We denote an m-channel input of the model by x Î á b m  , each element of x is a matrix of size a × b.We denote a convolution whose elements are matrices of size


. An L-layers feed-forward CNN can be expressed as  KD (Hinton et al. 2015), pruning (Hu et al. 2018), and quantization (Wu et al. 2016) are the main methods of model compression.They aim to reduce the volume and inference time of the model.For the first time, we apply these three methods to a deep-learning model of solar flare forecasting respectively, achieving near-lossless compression.In addition, an assembled compression framework is proposed by integrating the three compression methods together.In this framework, the KD is first applied to a pretrained model to get a compressed one.Then, the pruning of the compressed model is carried out.After that, the output of the pruning is further quantized.The details of these three compression methods are presented in the following content of this section.

Knowledge Distillation
KD (Hinton et al. 2015) is a popular model compression method in which a small model is trained to mimic a pretrained, larger model.It realizes a process of transferring knowledge from a large model to a smaller one.Figure 1 illustrates the knowledge distillation framework for training a student model, a simpler flare-forecasting model composed of four convolutional layers and one fully connected layer.In contrast, the teacher model is a pretrained flare-forecasting model built on the ResNet18 network, which consists of 17 convolutional layers and one fully connected layer.Under this framework, the student model is trained with Adam optimizer on the data sets mentioned in Section 2.
Distillation loss is the cross entropy loss between the teacher and the student with respect to their respective classification probabilities, which is called the soft loss  soft .
where v i and z i are the logits of the teacher and student models respectively, p i T and q i T are the softmax outputs of the teacher and student models on class i when the temperature is T, respectively.From Equation (2), the temperature T can change the output probability distribution.A larger temperature T would suppress the probability of the positive class and increase the probabilities of negative classes, resulting in a softer probability distribution to each sample with respect to all classes.This is equivalent to allocating more attention to negative classes so that the model learns the correlations between different classes better during training.Thus, the teacher model can teach the student model more since the negative tags also contain helpful information for model training.
The classification loss, namely hard loss, is the cross entropy between the output of the student model and the real label, which is called the hard loss  hard .
where c i is the real tag on class i, q i is the softmax output of the student model on class i, and N is the number of classes.The total loss is a combination of soft loss and hard losses:

Pruning
A pruning framework through pruning the convolutional kernels is shown in Figure 2. The core of pruning is to define a criterion to filter out unimportant parameters.Regarding a pretrained flare-forecasting model described in Liu et al. (2022), we implement the following pruning strategy to compress it.3. Pruning the last m (m < n i ) kernels with the smallest   W j 1 , n i represents the number of convolutional kernels the ith layer.
To CNNs, the pruning is achieved by discarding a part of convolutional channels, i.e., the whole convolutional kernels, according to their importance.Thus, the number of convolutional kernels/output channels is reduced, and the model size is greatly compressed after pruning.However, the model performance is compromised seriously, so the pruned model usually needs to be retrained.Regarding our case, the flareforecasting model Liu et al. (2022) after pruning compromises the flare-forecasting accuracy dramatically.Therefore, it needs to be retrained over the data set mentioned in Section 2 to recover its performance.The threshold for pruning is a hyperparameter to determine the ratio of the pruning parameters, it is empirically set to 0.5 in our experiments for the tradeoff between model performance and model size.Table 2 lists the number of kernels of the pretrained solar flare forecasting model (Liu et al. 2022) before and after pruning.

Quantization
Figure 3 illustrates the pipeline of quantization.The weights in the pretrained flare-forecasting model are 32-bit floating points.They are converted to low-bit fixed-point types using quantization.Quantization only changes the data types of the weights, while the architecture of the forecasting model remains unchanged.In this work, we employ a clipped uniform quantizer compress the weights.Since the weights from the same layer usually vary in a small range, we first clip the weights given the maximum and minimum thresholds.Then, the uniform quantizer with the clips can be expressed as

Assembled Compression Framework
To achieve a lower compression ratio, a series of techniques including knowledge distillation, pruning, and quantization can be applied sequentially to create an efficient compression framework.Knowledge distillation simplifies the network structure and reduces the number of parameters in a new model.Pruning removes parameters that do not meet a specific rule, while quantization represents the model parameters with fewer bits, such as converting 32-bit double-precision floating-point values to 6-bit fixed-point values.In the compression framework, knowledge distillation is performed first, followed by pruning and then quantization.

Experiment
In order to evaluate three different methods of compressing solar flare forecasting models, we applied them to a ResNet18 model that had previously been trained on a large-scale solar flare data set (Liu et al. 2022).We then assessed the compression ratio and forecasting accuracy of the model.Figure 4 displays maps of knowledge distillation, pruning, quantization, and their combination.We observed that the model was significantly compressed, with a compression ratio of 1.67% achieved through three rounds of compression (including knowledge distillation, pruning, and quantization).This means that the model was reduced to only 1.67% of its original size, while still maintaining nearly the same level of performance.

Evaluation Metrics
Solar flare forecasting is a binary classification problem.To evaluate a binary classifier, the contingency table is defined in Table 3.If an observed positive sample is correctly classified, "True Positive (TP)" would increase by 1, otherwise "False Negative (FN) increases by 1.If an observed negative sample is correctly classified, "True Negative" would by 1, otherwise "False Positive" increase by 1.In our case, an AR with at least one flare is regarded as a positive sample, while the others are negative samples.
To evaluate a solar flare forecasting model, several statistical parameters are computed from TP, FP, TN, and FN.First, the proportion of correctly classified samples can evaluate the accuracy of prediction/classification of the model, which is defined by However, in the case of extremely unbalanced samples, the accuracy cannot be an objective evaluation index.Therefore, true positive rate (TPR), true negative rate (TNR), false negative rate (FNR), and true positive rate (FPR) are used.They are defined by The TPR indicates what percentage of the positive samples are correctly classified.The different classification thresholds will generate different TPR and FPR.Taking FPR and TPR as x and y coordinates respectively, a coordinate system can be established, where each pair of FPR and TPR is a coordinate point.The receiver operating characteristic (ROC) curve is obtained by connecting these coordinate points.In addition, the area under the ROC curve (AUC) can also be used as an indicator to measure the quality of the model.The larger the   AUC value obtained, the better the forecasting performance achieved.

Experimental Results
In the first set of experiments, the three compression methods are imposed on a pretrained ResNet18 (Liu et al. 2022), respectively.The AUC of the original pretrained ResNet18 on the test set is 0.91.Taking this pretrained ResNet18 as a teacher model, a student model with simpler network structure is obtained after knowledge distillation.The AUC of the student on the test set is 0.94, which is even better than the teacher.The reason for this gain may lie in that knowledge distillation can not only transfer knowledge of a big model (teacher) to a small one (student) but also provide more compressed and low-dimensional features that were generally recognized to benefit model's generalization in machine learning.The performance of ResNet18 after pruning has a serious drop with the AUC of 0.68 since pruning is missing a lot of parameters that may extract fine features of magnetogram.Therefore, we need fine-tuning to restore its performance.In Figure 5, the two ROC curves with or without finetuning after pruning are compared, where remarkable improvement of AUC can be observed for the latter, restored from 0.68 to 0.90.Both knowledge distillation and pruning change the structure of the model, while quantization does not.We found that there is almost no drop of performance when the ResNet18 is quantized to 6-bit.However, as it is quantized to 4-bit, the AUC is down to 0.68.We compare the ROC curves of the different quantized ResNet18 in Figure 6, where "quantization 8-bit" has the AUC of 0.92, "quantization 7-bit" has the AUC of 0.91, "quantization 6-bit" has the AUC of 0.91, "quantization 5-bit" has the AUC of 0.88, and "quantization 4-bit" has the AUC of 0.68.There is almost no drop of performance when the ResNet18 is quantized to 6-bit.
While there is a serious drop of performance when the ResNet18 is quantized to 4-bit.
In the second set of experiments, evaluate the effectiveness of the proposed assembled compression framework, i.e., the three compression methods are implemented on the pretrained ResNet18 successively.First, knowledge distillation is implemented to obtain a student model with a simple structure.Then, the student model is pruned to eliminate the redundant channels in the filter groups.Finally, the pruned model is quantized to 6-bit to get a compact model.The AUC value of the final compressed model on the test set is up to 0.94.The ROC curve of the final compressed model is shown in Figure 7.The number of flops and parameters decrease to 0.289 × 10 9 and 0.944× 10 6 respectively, about 15% and 8% of the original ones.The size of the compressed model decreases from 42.70 MB to 0.71 MB, about 1.67% of the original size.For comparison, the compression ratio, the number of parameters, the size of the model and flops are listed in Table 4 for each individual compression method and their assembling.Since knowledge distillation and pruning simplify the structure of the model, the number of flops and parameters of the model decrease significantly after knowledge distillation and pruning.The number of flops and parameters of the pretrained ResNet18 are 1.819× 10 9 and 11.177× 10 6 , respectively.They can be dramatically compressed to 0.556× 10 9 and 1.898× 10 6 , respectively, by knowledge distillation.If the pretrained ResNet18 goes through the pruning individually, they can be compressed to 1.290× 10 9 and 7.791 × 10 6 , respectively.Since quantization does not revise the model structure, the number of flops and parameters keep the same.
Compared to the original model, there is no drop of flareforecasting accuracy in the compressed model.This indicates that the original model is highly redundant with a large number of redundant parameters.After knowledge distillation and pruning, the number of parameters is reduced dramatically to 12% of its original size while maintaining the same (even better) performance.This can increase the interpretability of the model, leading to an easier interpretation of flare mechanism from the aspect of physics.Then, quantization further simplifies the model by representing model parameters with fewer bits (from double float to 6 bits) with almost the same performance.
From the aspect of astrophysics, saliency methods (Simonyan et al. 2013) are employed to interpret why the flare-forecasting model keeps the same performance before and after compression.They can interpret the relationship between the model's prediction and the region of interest of input magnetogram that affects the prediction the most.We use Grad CAM++ (Chattopadhyay et al. 2018) to visualize the region of  interest of the model before and after compression.Figure 8 visualizes the saliency maps for five selected samples.It can be observed that the compressed model has almost the same region of interest as the original model, i.e., the peaks and neutral lines of magnetic field, which is well in line with the theoretical study.This implicitly indicates that the compressed model does not change the intrinsic features of input magnetogram in flare forecast.In other words, the features learned by the two models are equally identifiable to the classifier and even well aligned with respect to saliency maps although the compressed model has much coarser features than the original one.In addition, the intrinsic dimension of magnetogram is actually very small for flare forecast since although the number and precision of network parameters are greatly compressed, there is no significant discrepancy in learned features regarding saliency maps.The structural similarity index measure (SSIM) is computed to quantitatively measure the similarity between two saliency maps.In Figure 8, the SSIMs of all samples are greater than 0.95, indicating high similarity between the saliency maps of the original model and the compressed one.In addition, we also compute the SSIMs over all positive samples in the test set, obtaining the average SSIM of 0.8416, indicating good similarity between the original model and the compressed one.

Discussions
At present, the physical mechanism of solar flare is still unclear.Although many physical models of solar flares have been proposed, it still remains unclear what psychical mechanism triggers the occurrence of flares, and what factors/ parameters of the magnetic field determine the occurrence of flares.This makes it difficult to accurately forecast solar flares.Thus, most flare forecast methods still rely on empirical methods.The Carmichael-Sturrock-Hirayama-Kopp-Pneuman model (Svestka & Cliver 1992) described the morphological structure of the magnetic field in solar flares, in which the magnetic reconnection in solar corona produces bright arcades on the polarity inversion line (PIL).Possibly, the mutually  reinforcing between magnetohydrodynamic (MHD) instabilities and magnetic reconnection results in runaway energy release, accounting for solar flares.Thus, an MHD instability might be necessary for the onset of solar flares.However, the relationship between MHD instabilities and the occurrence of solar flares still remains ambiguous, so the flare forecast model based on the MHD instability has not yet been demonstrated.Ishiguro & Kusano (2017) proposed a different instability, namely the double-arc instability (DAI), which could be responsible for the initial ignition of flare.Kusano et al. (2020) discussed a particular scenario of DAI where a small-scale reconnection (namely trigger reconnection) occurs between electric current carrying magnetic loops above the PIL.This trigger reconnection could cause the formation of a double-arc magnetic loop, resulting in flares.Through exploring the conditions necessary for the trigger reconnection, one can know how close the magnetic field is to the critical state of the instability.Once the magnetic field approaches the critical state of the instability, a small-scale reconnection event could trigger a flare, indicating an imminent solar flare.This conclusion led to the so-called k-scheme flare-forecasting model (Kusano et al. 2020).Like theoretical models, deep learning-based models also focus on the PIL for flare forecasts.The visualization results of deep-learning models show that the regions of interest are mainly around the PIL.These regions of interest have the most influence on the forecast result of the neural network.However, deep-learning models cannot detect very small instability regions.The possible reason is that deep-learning models cannot extract the features of very small instability regions due to the limited resolution of magnetogram.In the future, more advanced magnetograph will provide higher-resolution and higher-quality magnetograms, which may enable deep-learning models to extract the finer features that have been confirmed by theoretical models to be highly associated with solar flare eruptions.
Regarding the above analysis of saliency maps, it should be pointed out that the features learned by DNNs cannot simply correspond to any physical quantities, MHD instability features, and PIL features.In fact, the former is derived from the hidden space while the latter are computed in the input image space.In addition, the intrinsic dimension of DNN features is related to tasks.Here, flare forecasting, a simple image binary classification, is not complicated relative to general image classification with subtle differences and details in features, which can be explained by the huge network compression ratio in Table 4.The difficulty is that the mechanism of the flare is unclear, so it is not known how to design a network suitable for flare forecast.Once the flare principle is uncovered, only a few physical/deep features might be able to well indicate whether a flare will occur, which would guide us to design a network specifically for flare forecast.

Conclusion
In this paper, we investigate a scenario where deep-learning models are deployed on satellites with limited resources.Three popular compression methods, knowledge distillation, pruning, and quantization, are examined on the compression of a pretrained ResNet18 for solar flare forecasting.All of them can achieve a high compression ratio without an obvious drop in forecasting efficiency.First, the three compression methods are implemented individually to verify their respective feasibility.Then, an assembled compression framework that integrates the three compression methods together and implements them successively is proposed.It can compress the pretrained ResNet18 from 40.70 MB to only 0.71 MB almost without the loss of forecasting accuracy.The proposed model can be deployed on satellites with limited storage and computing resources.
Through saliency analysis, it can be found that the model is always focused on the same region of magnetogram before and after compression, namely the peaks and neutral lines of magnetic field, in line with the theoretical study.This implicitly indicates that the compression does not change the intrinsic features of magnetogram in flare prediction.In addition, the intrinsic dimension of deep features is actually very small in flare prediction.
Moreover, knowledge distillation and pruning compress the pretrained solar flare forecasting model to 12% of its original size, simplifying the network structure dramatically.This can increase the interpretability of the model, leading to an easier interpretation of flare mechanism from the aspect of physics, which will be studied in our future work.
where x Î á b m  and y Î ć d n  are the input and output of the CNN respectively, and L is the number of the convolutional and full-connection layers.σ represents the activation function.Normalization has been absorbed into f with biases.
the weights, b represents bit depth, α is the clipping threshold, and round( • ) is the round-off function.The uniform quantizer quantizes the input x to the range [0, 2 b − 1].The algorithm process of quantization is detailed in Algorithm 1. Algorithm 1. Quantization for weights Input: weights of pretrained flare-forecasting model W; bit depth b Output: quantized weights W q Knowledge distillation uses the original model as a supervised signal to train a new, more compact model.If pruning or quantization were applied first, the original model could be significantly compromised.Hence, knowledge distillation should be implemented before pruning and quantization.Of the three compression techniques, quantization does not require model training, otherwise the fixed-point weights would revert to floating-point numbers.The other two methods, however, require model training.Therefore, pruning should come before quantization.

Figure 2 .
Figure 2. The framework of pruning.

Figure 4 .
Figure 4.The flow chart of model compression for solar flare forecast.The assembled compression model consists of a knowledge distillation, a pruning and a quantization module, compressing a pretraining deep-learning model progressively.

Figure 5 .
Figure 5.Comparison of the ROC curves with and without fine-tuning after pruning.

Figure 8 .
Figure 8. Visualization of saliency maps with almost the same region of interest.first row gives the input magnetogram; the second and third rows provide saliency maps for the compressed and original models respectively; the highlighted regions are the region of interest that is highly relevant to solar flare forecasting.

Table 1
Data Distribution from 2010 to 2019

Table 2
Comparison of the Number of Kernels for Each Layer before and after Pruning

Table 3
The Contingency Table for Binary Solar Flare Forecasting

Table 4
Computational Complexity of the Compressed Model Note. "FLOPS" means the number of floating-point operations, "Params" represents the total number of parameters of the network, and "Compression Rate" is the ratio between the compressed model and the pretraining model.