Predicting Gross Primary Productivity of the Forest Ecosystems using Machine Learning Techniques: A Review of Existing Approaches

Photosynthesis is a biotic process in which the plants assimilate the atmospheric CO2 into the sugar molecules in the presence of solar energy. The carbon uptake by plants in this process is defined as gross primary productivity (GPP). A part of this assimilated carbon is used by the plants to support their physiological activities which are defined as the respiration. The sequestration of carbon by the terrestrial ecosystems holds significance as a vital element of Earth’s carbon cycle and constitutes a major sink for the climate change mitigation. The crop yield of any agricultural field is directly linked with its GPP which is important in the aspect of food security and economy. Hence, quantifying the GPP of terrestrial ecosystems is an active branch of study and several methods have been used to address this. In recent times, the machine learning (ML) methods connecting the benefits of artificial intelligence (AI) have gained increased interest and different such methods are being used to address different scientific and technological problems. In addition to the traditional methods, several ML techniques have also been explored by several researchers for the GPP estimation. Studies have shown that ML models can produce GPP predictions with more accuracy. A comprehensive review of these methods will be helpful for the researchers due to a rapid development in this field. This paper offers a comprehensive analysis of various existing ML techniques to estimate the GPP, providing a comparative review of their effectiveness.


Introduction
Gross primary productivity (GPP) denotes the collective absorption of carbon by the primary producers such as plants in the process of photosynthesis [1].As part of their respiration process, plants release some of this ingested carbon back into the atmosphere to derive energy for the physiological activities such as tissue growth, maintenance, etc., and the remaining carbon is stated to as the net primary productivity (NPP) [2].Accurate estimations of GPP and NPP, and their seasonal and interannual variations are crucial for comprehending the intricate dynamics of the global carbon circulation and the ability of environment to withstand climate variations [3].The global biosphere-atmosphere CO2 flux is dynamic in nature and controlled by environmental conditions and underlying vegetation type [4][5][6].Prediction of GPP is difficult throughout time and space because of its spatio-temporal variability controlled by various biophysical and meteorological factors viz., temperature, rainfall, humidity, solar radiation, vegetation type, and canopy characteristics [7].Forests play a significant role as substantial reservoir for sequestering atmospheric CO2 [8].Gaining insight into the carbon interchange between forests and the atmosphere is of utmost importance in forecasting the global trajectory of atmospheric CO2 and also for implementing sustainable development strategies [9].Due to its crucial impact on global climate and social-economical significance, observation and simulation of GPP have received keen interest in research community.
Several global and regional measurement networks of flux-towers using eddy covariance exist aimed at continual monitoring of the ecosystem-atmosphere carbon, water and energy exchanges such as Fluxnet [10], AmeriFlux [11], AsiaFlux [12], ICOS [13] etc.Over the Indian region two such major initiatives are MetFlux-India [5] by the Ministry of Earth Sciences (MoES) and the geosphere-biosphere program (GBP) [6] by the Indian Space Research Organisation (ISRO).The measured carbon flux at any flux-tower is termed as NEE which is subsequently segmented into GPP and respiration using statistical regression tools [14].Such measurements have limited footprint depending on site environmental, topographical and turbulent features and often due to the requirement of a stringent measurement condition suffer from wide data-loss [15].Additionally, the number of such flux-towers are not sufficient over India to comprehensively map the GPP of its widely biodiverse forests spread over its wide geoclimatic span [16].In addition to these in situ measurements, researchers have also attempted to estimate GPP from space.The Moderate Resolution Imaging Spectroradiometer (MODIS) by NASA [17], USA has been providing the space-based estimation of GPP since more than two decades now [18].The different GPP products by MODIS with different spatio-temporal resolution and development algorithms have been commonly utilized in carbon cycle, climate shift studies and ecosystem health monitoring [19,20].The MODIS provides a way to continuously monitor GPP from space, both geographically and temporally.However, it is critical to validate the remotely sensed timebased dynamics of carbon and hydrological flows with those directly calculated from the surface level networks or modelled using the calibrated physiological models with the ground data as input [21].
Big data and artificial intelligence are two recent technologies that are evolving quickly.Machine learning (ML) based on data is a significant component of contemporary intelligent technology.Machine learning models can leverage extensive remote sensing data to effectively utilize and generate precise estimations of gross primary productivity over wide geographical extents and extended temporal periods [22].Studies also suggests that ML approaches like random forest have shown promise in enhancing the accuracy of GPP predictions [23].In our study, we examine various research studies available on estimating and predicting GPP using machine learning models.These include the studies using various data such as FLUXNET 2015, MODIS, Google Earth Engine (GEE), etc and ML approaches such as random forest, neural networks, etc.

Relevant Studies
There are numerous studies listed in this section for prediction of GPP using ML techniques which have been done globally.The Table 1 below gives a description on the selected studies.The GPP predictions displayed a strong performance with a high R 2 score of 0.84, along with RMSE of 1.29 g C m -2 day -1 and also MAE of 0.92 g C m -2 day -1 . [28]

Maize field in Northwest China
The The R 2 score for prediction of GPP was greater than 0.7 using ML methods. [35]

Methodologies and Datasets used
To analyse and predict the GPP, several researchers have used different datasets and methodologies.Some of the widely used datasets in the above studies are discussed below: half-hour to one year.The primary objective is to estimate the annual carbon flux, and thus, the authors specifically choose the annual mean NEE and GPP values obtained using both daylight hours and night-time approaches as the standard data for method evaluation and authentication [9].
3.1.4.GEE Data.Google Earth Engine (GEE) is a robust framework that operates in the cloud, known for its vast library capable of handling petabytes of data of open-source geographic data sets along with robust computation and analysis capabilities [37].The authors of [9] selected seven parameters which includes NDVI, EVI, evapotranspiration, LSTD, LSTN, precipitation and forest type.One limitation of the GEE platform is the non-availability of several key variables such as atmospheric CO2 concentration, soil properties etc., which limit the progress and applicability of the models and products built using this platform [9].To develop a GPP prediction approach with high precision, the utilization of GEE enables the acquisition of remote sensing data for numerous applications [25].
3.1.5.PERSIANN-CDR.This dataset, containing worldwide daily precipitation information, is obtained using ANN for precipitation estimation from remote sensing data [38].PERSIANN-CDR employs an ANN algorithm to process GridSat-B1 infrared data, providing a high-resolution precipitation product at a worldwide scale with spatial granularity of 0.25 degrees [25].where f(y) denotes the outcome of prediction using the RFR model and   () denotes the individual tree predictions [30].

Support Vector Regression (SVR)
. SVR is a regression technique that utilizes the principle of achieving optimal classification, to achieve best classification performance, an utmost margin hyperplane can be established by maximizing the separation between different data clusters, utilizing non-linear kernel functions like polynomial and Gaussian function [17].SVM endeavors to identify a hyperplane characterized by its direction, denoted as z, and an offset scalar, denoted as m (i.e., y = z.x+ m), for purpose of segregating the training data based on their respective labels.In instances where the original data exhibit non-linear separability, SVM utilizes a kernel function for transforming data from its initial domain to a feature realm of increased dimensionality, thereby enhancing the linear separability [29].The SVM algorithm was expanded to address regression problems, leading to the development of SVR, in SVR, a given set of training data and their corresponding target values {(x1, y1), (x2,y2),…,(xn,yn)}, a hyperplane is employed in a feature space of high dimensionality to effectively grasp the characteristics of training data, the fitting process is achieved through the equation  = () +  where  represents the kernel function [39].The selection of SVM is based on its favorable attributes, such as its straightforwardness, ability to perform global optimization, and robust predictive capabilities [31].

K Nearest Neighbor Regression (KNR).
The KNR approach is a simple yet effective machine learning technique that leverages a collection of available instances to make predictions on numerical targets by considering their similarity measured through distance metrics [32].In KNR technique, the numerical target of the K nearest neighbors (KNN) is averaged.Similar to KNN, KNR classifies neighbors based on distance.Some of the commonly used distance functions are defined below:

Artificial Neural Networks (ANN)
. ANN simulate the intricate workings of biological neural systems, comprising an initial input layer to handle explanatory variables, several hidden layers for performing complex non-linear computations, and a final output layer responsible for generating the desired outcome [40].The neural network's weight and bias sets are optimized through the minimization of the cost function, which measures the discrepancy between true labels and predicted value, to achieve this, hyperparameter optimization using R is performed, setting parameters such as 2 hidden units, a maximum iteration of 100, and utilizing the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method that dynamically adjusts the learning rate [17].In one of the study, ANN, or Artificial Neural Network, represents a fundamental neural network architecture characterized by direct connections between layers [29].

Gradient Boost Regression (GBR) Tree.
Gradient boosting is a versatile and data-driven statistical learning approach that effectively addresses classification and regression tasks [41].Renowned for its efficacy in predictive analysis, this machine learning model proves to be a formidable tool.The boosted trees model exemplifies an additive approach where predictions are generated by amalgamating the outcomes from a diverse array of base model selections [42].Models falling into this category can be formulated as: t(y) = h0 (y) + h1 (y) + h2 (y) +….. (4) In boosted tree models, the final classifier, denoted as g, is obtained by aggregating multiple base classifiers (fi).These base classifiers, which are simple decision trees, are combined through a process known as model ensembling, in which multiple models are used for obtaining enhanced predictive performance [42].In contrast to RF, which builds each base classifier separately (each utilising a subsample of data), GBR makes use of a specific model ensembling method termed gradient boosting.

Figure 1.
Basic Architecture for prediction of GPP using AI/ML models Figure 1 represents the basic architecture for prediction of GPP using AI or ML models which involves several components.The first component data, which includes meteorological, biophysical, topographical features along with the corresponding GPP values.The data is then pre-processed, involving steps such as cleaning, resampling, normalization etc. Subsequently, the processed data is partitioned into sets for training, testing, and validation purposes.Then, the ML/ AI models undergo training utilizing the data from the training set and validated using the test validation and test set.The predicted GPP values are obtained from the trained models.Finally, evaluation metrics are employed to gauge the performance of the models and ascertain their accuracy and reliability, providing insights into the model's effectiveness in predicting GPP.

Discussion
Based on our study, we can conclude that accurate prediction from traditional spectral indices or vegetation indices, tasseled cap transformations and spectral bands for GPP under seasonal-variability are essential as it assumes a critical role in the field of ecosystem science by significantly impacting our understanding of various ecological aspects, such as productivity, carbon cycling, nutrient distribution, and the preservation of biological diversity [24].It can be said that using multi-source data and ML techniques present a novel avenue for investigating the intricate carbon dynamics within terrestrial ecosystems [30].With the development of deep convolutional neural network, we can not only acquire precise estimations of carbon variables in forest ecosystems in a convenient manner but also gain profound insights into the underlying mechanisms governing these parameters [9].Extensive research has demonstrated that machine learning models have the capacity to achieve enhanced precision in predicting GPP compared to traditional methods [25].The inclusion of Köppen climate categorization data as additional independent factor in the random forest methodology has been shown to greatly enhance model performance, as it captures the temporal dynamics of vegetation throughout the seasons and leads to improved GPP predictions [26].The RFR model emerges as a robust and efficient modeling tool, offering the capability to estimate and potentially calibrate the MODIS GPP product with precision [27].By considering the responsiveness of GPP to discrepancies in atmospheric CO2 levels and climate conditions, exploring the prediction of terrestrial GPP becomes an intriguing endeavor when combining remotely sensed biophysical attributes with meteorological and topographical factors, such an investigation not only sheds light on the influence of various ML models on the uncertainty of GPP estimations but also enables the utilization of the RFR model to infer GPP values for both present and future periods, under diverse climate scenarios [28].
By integrating earth observation data and in-situ data through ML algorithms, there exists great potential to produce precise and timely forecasts of GPP, this integrated approach holds promise for applications in territorial to global ecosystem management services, serving as a valuable complement to established GPP estimation techniques, additionally, given the precision of the predictions and the space-time scope involved, the utilization of SVM-based GPP prediction can contribute to the documentation of hydro-ecological models and the enhancement of ecological modeling of terrestrial ecosystems from local to continental scopes [43].When compared to traditional AI models, deep learning methods like DNN offer a more advanced approach that demonstrates superior performance in assessment measures and the depiction of temporal variations, making it a preferred choice [17].Deep learning methods have a distinct advantage in their ability to effectively handle compressed features at both spatial and temporal scales, a task that conventional mathematical models often struggled with [44].Take the LSTM model as an example, which excels at capturing long-term system states for making predictions, however, when it comes to describing the complex correlation between the environmental factors influencing plant growth and GPP, a complex temporal analysis may not be necessary, employing the LSTM model to forecast GPP could introduce unnecessary complexity, demanding a larger volume of historical data and more intricate parameter configurations [29].

Conclusion
In our research, a comprehensive analysis of the machine learning (ML) approaches that are used for estimating gross primary productivity using flux tower measurements, MODIS remote sensing data, GEE data, etc. is presented.It can be suggested that utilization of ML and DL techniques can significantly enhance the accuracy and precision of GPP estimation.Given their reliance on data, ML IOP Publishing doi:10.1088/1755-1315/1285/1/0120149 and DL models hold significant promise in unraveling the enigmatic mechanisms underlying forest carbon absorption.By effectively capturing and modeling these mechanisms, these data-driven approaches have the potential to provide valuable insights and accurate estimations, shedding light on previously unknown aspects of carbon absorption processes in forests.In regression prediction, we can make use of ML advantages by leveraging the unique attributes of carbon exchange information, a prediction technique for GPP could be developed.To gain comprehensive insight into the dynamics of global carbon, it is worth to make use of FLUXNET and satellite observations such as MODIS with ML models for the simulation of GPP in ecosystem types such as grasslands, forests, etc.The artificial intelligence models have the ability of solving complex non-linear problems and have demonstrated the potential for applying these methods in forest ecological contexts.Our study can be helpful for the environmentalists and researchers working on GPP prediction and estimation.
In the future, advancing machine learning and deep learning can boost GPP accuracy.Expanding these models to diverse ecosystems like grasslands and wetlands would provide a more comprehensive understanding of global carbon dynamics.Integrating high-resolution remote sensing and continuous expansion of available datasets can contribute to enhancement of GPP predictions.Furthermore, environmental researchers can innovate GPP estimation using data-driven methods for climate change mitigation and ecosystem management.

10 A
Report of the Ministry of Earth Sciences (MoES), Government of India ed J and G C and M M and K A and C S Krishnan R. and Sanjay (Singapore: Springer Singapore) pp 73-92 [17] Lee B, Kim N, Kim E S, Jang K, Kang M, Lim J H, Cho J and Lee Y 2020 Primary productivity in the forests of South Korea using satellite remote sensing data Forests 11 [18] Justice C O, Townshend J R G, Vermote E F, Masuoka E, Wolfe R E, Saleous N, Roy D P and Morisette J T 2002 An overview of MODIS Land data processing and product status Remote Sens Environ 83 3-15 [19] Plummer S 2006 On validation of the MODIS gross primary production product IEEE Transactions on Geoscience and Remote Sensing 44 1936-8 [20] Tao J, Mishra D R, Cotten D L, O'Connell J, Leclerc M, Nahrawi H B, Zhang G and Pahari R 2018 A Comparison between the MODIS product (MOD17A2) and a tide-robust empirical GPP model evaluated in a Georgia Wetland Remote Sens (Basel) 10 [21] Coops N C, Ferster C J, Waring R H and Nightingale J 2009 Comparison of three models for predicting gross primary production across and within forested ecoregions in the contiguous United States Remote Sens Environ 113 680-90 [22] Zheng Y and Takeuchi W 2022 Estimating mangrove forest gross primary production by quantifying environmental stressors in the coastal area Sci Rep 12 [23] Tian Z, Yi C, Fu Y, Kutter E, Krakauer N Y, Fang W, Zhang Q and Luo H 2023 Fusion of multiple models for improving gross primary production estimation with eddy covariance data based on machine learning J Geophys Res Biogeosci [24] Bandopadhyay S, Pal L and Das R D 2021 Predicting gross primary productivity and PsnNet over a mixed ecosystem under tropical seasonal variability: a comparative study between different machine learning models and correlation-based statistical approaches J Appl Remote Sens 15 [25] Zhang K, Liu N, Chen Y and Gao S 2019 Comparison of different machine learning method for GPP estimation using remote sensing data IOP Conference Series: Materials Science and Engineering vol 490 (Institute of Physics Publishing) [26] Wei S, Yi C, Fang W and Hendrey G 2017 A global study of GPP focusing on light-use efficiency in a random forest regression model Ecosphere 8 [27] Duan Z, Yang Y, Zhou S, Gao Z, Zong L, Fan S and Yin J 2021 Estimating gross primary productivity (GPP) over rice-wheat-rotation croplands by using the random forest model and eddy covariance measurements: Upscaling and comparison with the MODIS product Remote Sens (Basel) 13 [28] Prakash Sarkar D, Uma Shankar B and Ranjan Parida B 2022 Machine learning approach to predict terrestrial gross primary productivity using topographical and remote sensing data Ecol Inform 70 [29] Guo H, Zhou X, Dong Y, Wang Y and Li S 2023 On the use of machine learning methods to improve the estimation of gross primary productivity of maize field with drip irrigation Ecol Modell 476 [30] Chen Y, Shen W, Gao S, Zhang K, Wang J and Huang N 2019 Estimating deciduous broadleaf forest gross primary productivity by remote sensing data using a random forest regression model J Appl Remote Sens 13 1 [31] Yang F, Ichii K, White M A, Hashimoto H, Michaelis A R, Votava P, Zhu A X, Huete A, Running S W and Nemani R R 2007 Developing a continental-scale measure of gross primary production by combining MODIS and AmeriFlux data through Support Vector Machine approach Remote Sens Environ 110 109-22 [32] Nathaniel J, Liu J and Gentine P 2023 MetaFlux: Meta-learning global carbon fluxes from sparse spatiotemporal observations Sci Data 10 440 [33] Wang H, Shao W, Hu Y, Cao W and Zhang Y 2023 Assessment of Six Machine Learning Methods for Predicting Gross Primary Productivity in Grassland Remote Sens (Basel) 15 3475 [34] Yu T, Zhang Q and Sun R 2021 Comparison of Machine Learning Methods to Up-Scale Gross Primary Production Remote Sens (Basel) 13 2448 [35] Tramontana G, Jung M, Schwalm C R, Ichii K, Camps-Valls G, Ráduly B, Reichstein M, Arain M A, Cescatti A, Kiely G, Merbold L, Serrano-Ortiz P, Sickert S, Wolf S and Papale D 2016 Predicting carbon

Table 1 .
An overview of the previous studies for prediction of GPP using ML approaches.
MOD11A2 and MCD12Q1 respectively used in the study[25].The EVI data can be acquired from the MODIS MOD13C2 product, while the determination of land cover information or plant function type was based on the MODIS MOD12Q1 product[26].The NDVI, FPAR and LAI, GPP data were from MOD13Q1, MOD15A2H and MOD17A2H respectively used in the study[27].
3.1.Datasets 3.1.1.MODIS Data.MODIS MOD17A2H is the unique terrestrial photosynthetic activity-monitoring product which is installed on NASA's Earth Observing System satellites, has been delivering worldwide GPP and PsnNet products, since the year 2000 [24].Remote sensing/ satellite imagery data including temperature, EVI/NDVI, land cover type data are MODIS MCD43A4, 3.1.3.FLUXNET 2015.To obtain field estimates of GPP and NEE for model training purposes, the researchers utilize the most recent release of FLUXNET 2015.This comprehensive dataset comprises records from 212 locations spanning 1500 site-years, capturing sequential variations ranging from