Application and evaluation of CNN-LSTM classification regression based multi-source precipitation data fusion model in water resources research in Qinghai Province

Qinghai Province is located in the Qinghai-Tibet Plateau region, with complex and diverse topography and sparse precipitation stations, which makes it difficult to obtain reliable precipitation data. This study proposes a classification and regression model based on a deep learning algorithm, which combines a convolutional neural network (CNN) and a long short-term memory neural network (LSTM), with the CNN extracting the spatial features of multi-source data, the LSTM capturing their temporal dependencies. The regression results are used to determine whether rainfall is occurring and to further calibrate the non-rainfall component of the precipitation forecast results. ERA5, IMERG, CHIRPS and DEM were selected as feature data and rain gauge data as label data. The findings indicate that the proposed CNN-LSTM classification regression model (CLCR) is superior to other models (CNN, CNN-LSTM, LSTM). The Kling-Gupta efficiency (KGE) of the data fused using CLCR was 0.66, which was significantly better than that of the raw rainfall data (0.53, -0.36, 0.34) and other models (0.58, 0.65, 0.63). CLCR also showed more performance in daily precipitation detection than other models and raw precipitation data, with Critical Success Index (CSI), Probability of Detection (POD), and False Alarm Ratio (FAR) 0.61, 0.25 and 0.76 respectively. This study generated a high-precision daily rainfall dataset with a precision of 0.01° resolution for 2013-2017 in Qinghai Province, which provides reliable data support for hydrological studies in Qinghai Province.


Introduction
Precipitation is an important part of global water circulation and is of critical importance for decisionmaking and planning a diverse assortment of domains involving hydrology, meteorology, climate and agriculture [1,2] .Accurate observation of the spatio-temporal distribution of rainfall contributes to an indepth understanding of the evolution and regulation mechanisms of hydrological processes and provides a reliable basis for decision-making in related fields [3] .However, the full advancement of global precipitation measurements faces difficulties due to the significant spatio-temporal differences in rainfall and the constraints of specific geographic environments, especially on the Tibetan Plateau at high spatial and temporal scales [4] .
Ground-based rain gauge data, satellite remote sensing data and radar monitoring data are the main sources of precipitation data and precipitation fusion products [4].Ground rain gauges can directly measure surface precipitation and accurately detect precipitation.However, due to the influence of terrain, the spatial distribution of rain gauges is not uniform.The deployment of rain gauges is relatively scarce in areas with complex terrain and off the beaten track, such as plateaus, mountains, and deserts, and even completely missing in uninhabited areas [5] .There is a large uncertainty in generating spatially continuous distribution of precipitation data by spatial interpolation of rain gauge data only [6] .Satellite remote sensing data have the advantages of global coverage and high spatio-temporal resolution, but their accuracy is still affected by topography, atmosphere, inversion algorithms, and other factors [7] .Radar observation of precipitation has the merits of high spatio-temporal accuracy and wide measurement range, but its accuracy is vulnerable to the influence of complex terrain [8] .In the last few years, several fusion approaches have been proposed to improve rainfall data's spatial and temporal resolution and accuracy [9] .These include inverse error variance weighting, geographically weighted regression, optimal interpolation, Bayesian model averaging methods, and regression kriging interpolation [10] .
With the increase of remote sensing data and advances in the field of deep learning, some deep learning algorithms such as CNN, LSTM and Artificial Neural Networks (ANN) have been applied to various precipitation fusion studies.However, These models only single consideration of spatial dependence or time dependence of precipitation data.To this end, in this research, we propose a multisource precipitation fusion model based on CNN and LSTM.The method is based on a classification and regression model of a deep learning algorithm, which calibrates the non-rainfall part of the precipitation prediction results by the regression results (used to determine whether it rains or not), takes into account the temporal and spatial correlation of precipitation data comprehensively, and is used to generate highly accurate precipitation datasets.

Research region
Qinghai Province, with an area of about 723.3 thousand km 2 , was selected as the present research region.As depicted in Figure 1, Qinghai Province is located in northwestern China from 31°N to 40°N and 89°E to 103°E, with an average altitude of over 3,000 m and a general topography of low in the southeast and high in the northwest.The mean annual rainfall of the whole Qinghai Province is 230-490 mm, but the annual precipitation is extremely unevenly distributed in time, with a wet period from May to September and a dry period from October to April [11] . Figure 1.The geographical position of the research region and the spatial spread of rain gauges.

Data information
As shown in Table 1, four precipitation data were used in this study: ERA5, IMERG, CHIRPS, and Rain Gauge, and one auxiliary data: DEM.

Methodology
The research methodology of this paper is depicted in Figure 2. The entire course can be summarized in three distinct phases: (1) Preprocess all data so that the spatial resolution of CHIRPS, ERA5, IMERG and DEM is resampled to 0.01°, the time scale is unified to 1 day, and extract training data in all data.
(2) Combining CNN and LSTM to construct classification regression models (CLCR), and compare it with CNN, LSTM, and CNN-LSTM.(3) Precipitation fusion and evaluation, where the fused data are evaluated and corrected using observed data in model training.

Fundamental neural network architectures
CNN mainly uses convolutional and pooling layers to achieve efficient feature extraction and characterization learning of input data.In precipitation data, the precipitation information of every grid spot is correlated with the around information, so the features of the goal grid spot and the around grid points can be extracted using CNN.LSTM is a specialized variant of recurrent neural networks (RNNs) intended to process continuous data that exhibit long-term dependencies.Compared with the traditional RNN, LSTM introduces a gating mechanism, which effectively solves the problems of gradient disappearance and gradient explosion faced by RNNs in long sequence tasks.In precipitation, the precipitation data at any given time point is influenced by the precipitation data at all previous time points.Therefore, LSTM can be used to extract the precipitation information without using time points.

Performance evaluation
In this study, Kling-Gupta efficiency (KGE) was selected as the evaluation criterion to quantitatively assess the performance of the proposed categorical regression fusion model.KGE is a comprehensive evaluation index that includes the correlation coefficient (CC), the variability ratio (Gama) and the bias ratio (Beta).The calculation formula is as in equation ( 1), ( 2), ( 3) and (4):

Conclusion
In this paper, a CNN and LSTM-based categorical regression model (CLCR) is suggested to address the spatio-temporal correlation of precipitation.In addition, this paper is compared with three regression models based on CNN, LSTM and CNN-LSTM.The findings show that: (1) In the Qinghai province area, the CLCR model has a better fusion effect than the other three fusion models and the raw precipitation data.The average KGE of the CLCR model is 0.66, CSI is 0.61, and FAR is 0.25, and the average KGE is improved by 25%, CSI is improved by 8.9%, and FAR is reduced by 36% compared with ERA5, which has the best effect.(2) CLCR improves the temporal and spatial accuracy of the three raw data.The fused precipitation data match better with the measured data and can better reflect the actual precipitation in Qinghai Province.

Figure 2 .
Figure 2. Research Route3.1.Data processingFirst, the time dimension of different datasets needs to be unified to 1 day and the time series needs to be matched (unified to UTC).Second, considering the different spatial resolutions of the unused data sets, the spatial resolution of the IMERG data, CHIRPS data, and ERA5 data was reduced to 0.01° using nearest-neighbor interpolation, and the DEM data were also resampled to 0.01°.Considering that the dimensionality of DEM data is different from that of precipitation data, which may affect the convergence speed of the model, it is normalized.Finally, a 5×5 grid data centered on the rain gauge is extracted.Each dataset dimension is (72880, 4, 5, 5), where 728800 denotes the sample size (1822 days × 40 site areas), 4 denotes the time dimension, and 5 denotes the height and width of the grid as shown in Figure3.

Figure 3 .
Figure 3. Schematic diagram of multi-source precipitation data and DEM sub-grid data extraction.

Figure 4 .
Figure 4. CLCR network structure diagram.The ellipse represents the input data and output results, where the content is the time dimension, height and width, and the rectangle represents the network layer, where the content is the network type and hyperparameters.To fuse precipitation data and more accurately model the spatiotemporal correlation between rainfall gauges and satellite precipitation data, we propose a deep neural network-based classification regression model.This model cleverly combines a CNN and LSTM, extracting the spatial features of precipitation data by CNN, capturing the temporal dependence of precipitation data by LSTM, and using the classification results to correct the final fused data.The CLCR is divided into three main parts, and the network structure and hyperparameters are shown in Figure 4: • The first part is the convolutional layer, and to extract the spatial features of precipitation data more comprehensively, we use CNN for each of the four kinds of data for independent spatial feature extraction.Then the spatial features of each data are transformed into one-dimensional data and combined.• The second part is a recurrent layer that uses LSTM to extract the temporal dependence of the feature vectors generated by the CNN.The temporal dimension of the data is 4. • The third part is the classification and regression layer, where the data from the last moment of the LSTM is fed into the fully connected neural network to generate the classification and regression results, where 0 in the classification result means no rainfall and 1 means rainfall.The regression result indicates the specific rainfall amount.

Figure 5 .
Figure 5. (a): the KGE of different seasonal multiple source data; (b): the CSI of different seasonal multiple source data.

Table 1 .
Data Information