Impact of large kernel size on yield prediction: a case study of corn yield prediction with SEDLA in the U.S. Corn Belt

Predicting agricultural yields is imperative for effective planning to sustain the growing global population. Traditionally, regression-based, simulation-based, and hybrid methods were employed for yield prediction. In recent times, there has been a notable shift towards the adoption of Machine Learning (ML) methods, with Deep Learning (DL), particularly Convolutional Neural Networks (CNNs) and Long-Short Term Memory (LSTM) networks, emerging as popular choices for their enhanced predictive accuracy. This research introduces a cost-effective DL architecture tailored for corn yield prediction, considering computational efficiency in processing time, data size, and NN architecture complexity. The proposed architecture, named SEDLA (Simple and Efficient Deep Learning Architecture), leverages the spatial and temporal learning capabilities of CNNs and LSTMs, respectively, with a unique emphasis on exploring the impact of kernel size in CNNs. Simultaneously, the study aims to exclusively employ satellite and yield data, strategically minimizing input variables to enhance the model’s simplicity and efficiency. Notably, the study demonstrates that employing larger kernel sizes in CNNs, especially when processing histogram-based Surface Reflectance (SR) and Land Surface Temperature (LST) data from Moderate Resolution Imaging Spectroradiometer (MODIS), allows for a reduction in the number of hidden layers. The efficacy of the architecture was evaluated through extensive testing on corn yield prediction across 13 states in the United States (U.S.) Corn Belt at county-level. The experimental results showcase the superiority of the proposed architecture, achieving a Mean Absolute Percentage Error (MAPE) of 6.71 and Root Mean Square Error (RMSE) of 14.34, utilizing a single-layer CNN with a 15 × 15 kernel in conjunction with LSTM. These outcomes surpass existing benchmarks in the literature, affirming the efficacy and potential of the suggested DL framework for accurate and efficient crop yield predictions.


Introduction
Crop yield prediction is of great importance for the formation of national and international agricultural and food policies towards sustainable agriculture worldwide.Yield prediction also plays an important role in crop management, helping agricultural and socioeconomic decision-makers and examining the effects of climate change on food security [1].
Early studies of agricultural yield prediction were based on conventional experience-based agricultural research.These studies are often derived from statistical analysis and lack relevant biological and physical principles [2].The methods based on correlation and regression analysis, which were initially used in yield prediction, were replaced by mathematical models expressing the biological and physical characteristics of plants over time.Integrating these mathematical models with soil, weather, climate, management practices and similar data became possible with the advances in computers and ease of accessing them in the late 1990's [2].Since then, the research in the area has been focused on utilizing different models and applying basic statistical evaluations.Starting off with early 2010 ML-based methods have been applied to the area of yield prediction [3].In 2015 the first DL methods were applied on yield prediction [4].
Various methods and models have been developed to predict crop yield ranging from purely statistical methods to agrogeo-meteorological approaches, from empirical bio-physics-based methods to mechanistic simulation models.Although these methods and models have different complexities and input data requirements, they can be classified into two different types from the operational perspective in the literature: Empirical methods and dynamic methods [1,5] Empirical Methods cover all flavors of correlative, statistical, and regression methods.These are generally expressed by regression equations using a few parameters that are assumed to be independent of each other [1,5].In one of the significant articles on regression-based yield prediction Makowski and Michel described a simple space-state model to analyze the yield time series for several crops.They proposed a dynamic linear regression model for predicting yield using a Kalman filter-based algorithm [6].Shastry et al compared different regression techniques such as quadratic, pure-quadratic, interactions, and polynomials for the prediction of wheat, maize, and cotton yields [7].In another article by Michel and Makowski, eight different regression methods including Holt-Winters with and without trend, dynamic linear models with and without trend, linear regression, quadratic regression and cubic regression for wheat yield prediction were compared [8].
Dynamic Methods are also known as process-based methods and CERES-Rice, CERES wheat 2.0, MACROS, SIRIUS, SUCROS, STICS, and WOFOST can be considered as the best examples.These approaches model the progression of crop growth over time by employing intricate equations and an extensive array of input parameters, encompassing information related to soil composition, atmospheric conditions, weather patterns, plant genotype, and agricultural management practices [1,9].Although these methods perform better since various types of inputs are fused, the acquisition of the required data is feasible at smaller scales such as farmlevel.Acquisition of data poses an essential limitation when larger scale areas are considered because of the cost of producing such high-resolution and accurate data at that scale [1,10,11].
Remote sensing information, particularly multispectral satellite imagery, holds crucial data for assessing vegetation growth.These images encompass visible, infrared, and other bands characterized by high spatial and temporal resolutions.The significance of satellite data lies in its global accessibility, cost-effectiveness, and ease of application for large-scale crop prediction [12].Initially, vegetation indexes such as Normalized Difference Vegetation Index (NDVI) [13], two-band Enhanced Vegetation Index (EVI2) [14], and Normalized Difference Water Index (NDWI) [15], have been used for forecasting crop yields.These indexes are computed by using raw data of certain bands.It is a known fact that some other bands that may have important information on vegetation were not considered for the sake of simplicity and lack of sufficient computing power and algorithms.
With the advances in computer technology ML techniques for crop yield prediction became applicable and feasible.Support Vector Machine (SVM) [4,16], Decision Trees [16], Multi-Layer Perceptron (MLP) [16], and Restricted Boltzmann Machine (RBM) [4] are some of the most frequently used ML techniques since the beginning of this trend [17].Since 2015, the first preliminary studies on yield prediction with deep learning have begun to emerge in the form of non-standard special neural networks [4].To the best of our knowledge, the first standard DL approaches such as CNN and LSTM networks were used for crop yield prediction by You et al [12].

Literature survey
In 2015, Kuwata and Shibasaki proposed DL and Support Vector Regression (SVR) methods for predicting corn yield in the U.S. Illinois State at county level.They used MODIS EVI smoothed by wavelet transformation and Climate Research Unit (CRU) monthly maximum temperature (Tmax), minimum temperature (Tmin), potential evaporation, and pressure as inputs [4].
In 2017, You et al proposed CNN and LSTM approaches for county-level soybean yield prediction.They used histograms instead of raw images including SR, LST, and land cover type derived from the MODIS satellite.They also incorporated a Gaussian Process (GP) layer on top of their CNN and LSTM architectures [12].
In 2018, Russello suggested a 3-dimensional (3D) CNN model for soybean yield prediction.She used MODIS SR, LST, and land cover data for leveraging spatial, spectral, and temporal dimensions [17].Wang et al proposed LSTM for Soybean yield prediction in Argentina and Brazil.They used histogram representation of MODIS SR, LST, and Land Cover data.They also stated that the transfer learning (TL) approach improves the results [18].
In 2019, Mu et al proposed a CNN network structure for the prediction of winter wheat.They used SR, LST NDVI, EVI, FPAR, LAI, ET, Average Latent Heat Flux (LE), Total Potential Evapotranspiration (PET), Average Potential Latent Heat Flux (PLE), GPP and Net Photosynthesis (PsnNet) derived from MODIS.They applied histogram dimensionality reduction and time series fusion before feeding the CNN network [19].Kaneko et al used LSTM-based network architecture and histograms as dimensionality reduction techniques to predict maize yield in six different African countries.They used MODIS SR and LST data as inputs to their network [20].Sun et al proposed a deep CNN-LSTM model for predicting soybean yield in 15 states of the Continental United States (CONUS) at the county level.They used MODIS SR and LST data and Daymet weather data as inputs [21].
In 2020, Wang et al proposed a two-branch DL model to predict winter wheat yield in China at county level.The first branch was based on LSTM with inputs from metrological data and Advanced Very HighResolution Radiometer (AVHRR) SR data and the second branch was based on CNN with soil data [22].Sharma et al proposed CNN-LSTM architecture for predicting wheat yield in India at tehsil level.They gave raw MODIS images including SR, LST, and land cover data to CNN architecture for extracting relevant features then they used LSTM for determining temporal relationships [23].[26].Ghazaryan et al tested RF, 3D CNN and CNN followed by LSTM for corn and soybean yield prediction in the U.S. at both county and field scales.At county level, they used MODIS-based SR, LST, and ET time series as input datasets.At the field-level analysis, they used NDVI and Normalized Difference Moisture Index (NDMI) computed from NASA's Harmonized Landsat Sentinel-2 (HLS) product.They also used Cropland Data Layer (CDL) for masking corn and soybeans [27].
In 2021, Ma et al suggested that traditional DL methods are easily prone to overfitting when the number of samples in the training set is small and they proposed a Bayesian Neural Network (BNN) for predicting countylevel corn yield in the U.S. Corn Belt to overcome overfitting.They used EVI, Green Chlorophyll Index (GCI) and NDWI, and daytime and nighttime LST derived from MODIS, Parameter elevation Regressions on Independent Slopes Model (PRISM) climate data and Soil Survey Geographic database (SSURGO) soil data as inputs to BNN [28].In their study, Qiao et al proposed a combination of 3D CNN and Multi-Kernel Gaussian Process (3DMKGP) for predicting winter wheat yield in China at county level.They used MODIS SR, LST, and annual land cover data as inputs to their network [29].Gavahi et al suggested that they achieve more accurate and reliable spatiotemporal feature extraction by using DeepYield, which integrates the Convolutional Long Short-Term Memory (ConvLSTM) with the 3-Dimensional CNN.They used MODIS SR, and LST land cover data as inputs to DeepYield for predicting soybean yield in the U.S. [30].
In 2022 Li et al proposed CNN and LSTM-based DL models to predict corn in agroecological regions in China at county level.They used MODIS SR (blue, red, and near-infrared bands), temperature/precipitation data based on the weather stations across China, and China Meteorological Forcing Dataset (CMFD), soil and elevation data [31].In her Master of Professional Studies (MPS) thesis, Kuang suggested a CNN-GP integrated model for predicting corn yield in 45 counties of U.S. New York state at county level.MODIS SR and LST, CDL are used as inputs to her model [32].
In 2023, Qiao et al proposed a Knowledge-guided Spatial-Temporal Attention Graph Network (KSTAGE) for winter wheat and corn yield prediction in China and soybean yield prediction in CONUS at county level.Their network is composed of three parts: a spectral encoder, a knowledge-guided spatial-temporal attention algorithm, and a location-aware spatial attention graph network.They used MODIS SR product, land cover product, and ground-truth yield data as inputs to their data [33].Tende et al used the LSTM model to forecast district-level corn yield in Tanzania.NDVI computed from MODIS, corn crop mask from International Food Policy Research Institute (IFPRI) Spatial Production Allocation Model (SPAM) 2010, and Tmax, Tmean, SM, and PPT from TerraClimate dataset are used as inputs to the proposed LSTM model [34].
Recent research results for yield prediction for a number of crop types through various DL methods are summarized in tables 1 and 2. In these tables we have focused on 4 properties of the previously suggested methods: (1) Area of interest and Sampling level: It is clear that dealing with smaller areas of interest ends with more accurate results.However, the number of samples would be limited to work with ML techniques.On the other hand, a bigger area of interest introduces high variation in the features which leads to inaccurate results.It is seen that this tradeoff directs researchers to handle the issue at county level.
(2) Features: The features listed in these tables are input values fed into DL networks.These inputs can also be considered as input variables sent to the respective prediction algorithms.It can be seen that satellite image, usually MODIS, is an essential feature along with climate and soil data.
(3) Prediction Algorithm: The proposed DL prediction algorithms are generally based on CNN, LSTM, or both.
(4) Performance: In order to make a comparison commonly used evaluation metrics such as MAPE, RMSE, Mean Absolute Error (MAE), and coefficient of determination (R 2 ) taken from respective references were listed.RMSE and MAE are measured in Bushels Per Acre (bu/ac) where MAPE and R 2 are unitless (percent).
It is preferred to use two tables instead of one to show that crop yield prediction using DL methods is applicable to different types of crops that are not producing similar results such as the average MAPE values for soybean is smaller than the one for corn and winter wheat.Moreover, in this study, corn is considered as a case study.However, it is important from a methodological perspective to generally assess the DL methods used for other products and the datasets (features) these methods utilize, to understand the overall progress in this regard.The fundamental reason for using two different tables is to enable a collective observation of the methods proposed for corn and especially the performance achieved by these models.As it can be easily seen through the given representative papers of different groups similar results for a given crop were produced.Small differences between the results are due to the input data used and the way these data are pre-processed.Overfitting occurs when the DL model gives accurate predictions for training data but not for new data.One of the causes of overfitting is working with a small set of ground data and satellite data.Although, one way to prevent overfitting is to increase the number of samples, reducing the complexity of the DL model may also be a solution when the number of samples cannot be increased.

Motivation, scope and limitation
Predicting crop yields is a complex task, traditionally relying on collecting data from fields, which can be both costly and time-consuming.Moreover, these methods are often limited to specific datasets and may not be accurate in new or changing environments.The motivation behind this study is to introduce a novel approach that leverages only satellite and yield data for crop yield prediction.This not only reduces costs and improves efficiency by eliminating the need for extensive field data but also enhances adaptability, enabling accurate predictions in diverse and changing settings.The study focuses on corn yield prediction in the U.S. Corn Belt, aiming to address the limitations of current methodologies and demonstrate the advantages of a data-efficient, adaptable, and cost-effective approach.
Most of the CNN-based yield prediction models use a 3 × 3 size kernel for feature detection.Obviously selecting a 3 × 3 kernel size minimizes the required execution time.However, the impact of the various kernel sizes has not been evaluated although it is a fact that the size is important in feature detection [35].
In this study SEDLA is used for both irrigated and rain-fed corn yield prediction [36].SEDLA used MODIS SR, LST and previous years yield data as features to predict corn yield in the U.S. Corn Belt for years 2019, 2020, and 2021.SEDLA consists of two modules to exploit both spatial and temporal features of the data.The first module is a single-layer CNN architecture with various kernel sizes spanning from 3 × 3 to 15 × 15 for convolution.The CNN module takes the previous ten years' histogram-based MODIS SR and LST data as inputs.The second module is an LSTM architecture that takes the previous ten years' yield data as input.A fully connected layer takes CNN and LSTM outputs as inputs and produces yield predictions.This study has two main aims: (1) To show that using a simple network architecture may be more advantageous, especially in situ ations where the number of samples is small, such as yield prediction over large areas.(2) To show that the use of Tanzania district-level MODIS NDVI, corn mask from Bimodal LSTM MAPE: 3.66% IFPRI SPAM 2010 and T max , T mean , SM and PPT from TerraClimate dataset a larger kernel size for single-layer, rather than the usual use of multi-level 3 × 3 kernel size, can give better results.
Despite the innovative approach, this study has some limitations.The primary constraint is the relatively small sample size available for training the model, particularly at the county level.This limitation is addressed by proposing a simple DL architecture.Specifically, although this is out of scope of this study examining additional data such as climate, soil, etc, with a model having a large kernel size and a deeper model involving large kernel size in the initial layer and a small kernel in the subsequent layers should also be considered.

Area of interest
The area of interest of this research is the U.S. Corn Belt comprising 13 states of the USA; Illinois, Indiana, Iowa, Kansas, Kentucky, Michigan, Minnesota, Missouri, Nebraska, North Dakota, Ohio, South Dakota, and Wisconsin.Although these 13 states consist of 1175 counties in total, after the pre-processing operations explained in subsection 4.3 Data Pre-processing and Tensor Generation.739 counties are selected for further operations in this research.Figure 1 shows these 739 counties used for training SEDLA.

Datasets
As this research is to suggest simple yet precise DL architecture, an optimized set of data has to be considered.MODIS is considered an essential remote sensing data resource.Out of forty products of MODIS data only two products, namely SR and LST are used in SEDLA since these two products are closely related to crop growth [25].CDL is used for masking corn-planted areas.Ground truth yield data is used both as expected data (labels) in the DL model and as a feature (input) to the DL model.
MODIS Surface Reflectance: Terra Surface Reflectance 8-Day Global 500 m (MOD09A1) provides seven SR bands at a spatial resolution of 500 m for every 8 days [38].Six of these bands (Red, Near-Infrared-1, Blue, Green, Short-Wave Infrared-1, and Short-Wave Infrared-2) were used for determining crop growth state in our research.Near-Infrared-2 is left out for reducing input size since its correlation with the yield is shown as low in various papers in the literature such as in the work of Kang et al [25].
MODIS Land Surface Temperature: Terra Land Surface Temperature and Emissivity 8-Day Global 1 km (MOD11A2) provides an average 8-day per-pixel with a 1 km spatial resolution is used for extracting the LST at daytime [39].LST nighttime is not used to reduce input size since the impact of LST nighttime is negligible on plant growth unless extreme cases such as frost rarely take place.Yield Data: County-level corn yield data was collected from the USDA NASS Quick Stats database [40].The data covers the period from 1999 to 2021, and was used for two purposes: as input data for the LSTM module that was used to predict corn yield for future years, and as labeled (expected) yield data for the training deep learning (DL) model.
USDA NASS Cropland Data Layers: CDL is a geo-referenced crop-specific land cover data layer created annually for the continental United States at 30 m spatial resolution is used for masking corn planted area [41].
Apart from the above-mentioned datasets which are directly used for yield prediction, the United States Census Bureau TIGER dataset was used to group data in terms of counties [42].
All of the datasets with the exception of yield data are accessed and relevant computations are performed on the GEE platform.

Data pre-processing and tensor generation
The pre-processing steps are summarized in figure 2. Data pre-processing starts with copying yield data from the USDA NASS Quick Stats database to the local SQLite database.County-level corn yield data of 1175 counties in the area of interest for the years 1999-2021 are collected.The corn yield database happens to have some missing data.In order to minimize the overall error that might be posed by regression-based imputation data, counties with more than four years of missing yield data are excluded from the dataset.For counties with four or fewer years of missing yield data, is preferred to use regression for imputation.By the first stage of elimination of the counties with regard to missing yield data the shortlist of counties is passed to the GEE platform for further data pre-processing.
Since the data available in the GEE platform do not include county information, the United States Census Bureau TIGER dataset is used to obtain the boundaries of counties.Initially, MODIS SR, LST, and CDL datasets are masked by using the overall shortlisted county boundaries of the Corn Belt.These datasets are also filtered for years 2009 to 2021 so that the datasets have been reduced and made ready for fast processing.MODIS SR data is cloud-masked and only the most correlated 6 bands are selected for further processing [25].Only daytime LST band is selected from the MODIS LST dataset since daytime data is highly correlated and nighttime data is low correlated [25].
USDA NASS CDL dataset is used for masking non-corn pixels in shortlisted counties.As the focus of this research is on corn the active days from planting and harvesting are taken as April 23rd to November 9th considering the MODIS 8-day period of the satellite.The days are corresponding to Day of Year (DOY) 113 to DOY 313, and 200 days give 26 samples of MODIS data for a year.
In the process of yield prediction over large areas using satellite data, the most time-consuming task is processing satellite images.Although using raw image data is more effective in discovering features in DL methods, most of the approaches in the literature use data reduction techniques such as calculating mean [13] or subsampling [12,17].In this research instead of averaging or subsampling, a histogram-based data reduction technique is preferred to preserve the entire pixel and/or spatial data distribution.Moreover, a histogram-based data reduction technique significantly reduces the size of the data while maintaining the features of the data [12].A number of preliminary tests as well as the values taken from the literature [21,26] indicate that a 32-bin solution would be sufficient for yield prediction.The histograms are computed county-based in the GEE platform considering 32 bins for collection.
For each county 32-bin histograms are downloaded from the GEE platform to the locally available SQLite database for further pre-processing operations.In the local database, each county has 26 samples through seven bands for each year.If a county histogram dataset contains missing data, regression-based data imputation is performed given that sufficient data exist for imputation.In case imputation is not possible, the county is excluded from the shortlist.
Even if we have ended up with such a shortlist through several filtering processes as explained above and shown in figure 2, we observed that some of the counties lack of significant amount of corn planted area.Preliminary evaluations showed that more than 10,000 acres have to be corn planted for a county, in order to contribute positively to the computation of yield prediction.In practical computation, any county whose corn planted area is less than 12,355 acres (corresponding to 200 pixels for a satellite image) is excluded.
During the pre-processing stage, the number of counties was reduced from 1175 to 739 in two stages.In the first stage counties having more than 4 missing yearly yield data were removed from the list.304 such counties were excluded from the shortlist.In the second stages, some counties having missing parts in satellite data which total planted area stayed below 200 pixels, and some counties having very little corn planted area are excluded from shortlist.Corn-planted area (i.e., total pixels in a histogram) is an important factor for the deep learning network to learn.The deep learning network proposed captures the trends of 32-bin histogram data across different time steps.It was observed that the 32-bin histogram data did not create a trend in counties with a total corn planted area of less than 200 pixels.These counties were excluded as their low pixel count per bin wouldn't generate a meaningful trend for the network to learn.132 such counties were excluded from the shortlist Finally, only 739 out of 1175 counties are shortlisted for DL computation.As the corn planted area of 739 counties varies in a large span, a final data pre-processing is performed for normalization of the data.Instead of classical normalization methods such as min-max normalization, Z-score normalization, and decimal scaling normalization [43], the ratio of the pixels in a bin to the total number of pixels per county is considered as a normalization operation.
DL model introduced in this research demands tensors as input.These tensors were generated through a transformation based on a 32-bin histogram of image composites, incorporating six SR bands and one LST band.The initial observations on the data indicate that theoretical minimum and maximum values cannot be reached since the data is taken for only 200 days.In order to see the distribution of the data accurately, existing minimum and maximum values for each band are recorded and bin limits and sizes are computed with respect to these values as shown in table 3. The difference between new min and new max values is divided into equal 32 bins for each feature separately.
26 time steps are taken per year per county in terms of 32 bins as explained above.Since there are 7 features to feed in the tensor becomes a 3D data of size 26 × 32 × 7, where the first dimension represents the time step, the second dimension represents the number of bins, and the third dimension represents the number of bands.This tensor is prepared for the CNN module in our architecture (figure 3).Note that y p indicates the year that will be predicted and y i is used as an index to point to the previous years' data in figure 3. Another tensor of size 10 × 1 is prepared for the LSTM module in our architecture from the ground truth yield values provided by USDA NASS statistics (figure 4).Similarly in figure 4, y i −1 to y i −10 is used to indicate previous years' yield data.
In order to predict the yield of a year, SEDLA utilizes both satellite data for the previous ten years and ten years of the yield data for each particular year fed into CNN.For example, in order to predict yield for the year 2020, satellite data between 2010-2019 have to be fed into CNN and for each year between 2010-2019 previous ten years' yield data, such as 2000-2009 data for the year 2010, have to be fed into LSTM.
10 years of sampled data for 739 counties ends in a total of 7390 samples, each of which comprises of a tensor of satellite data and a tensor of yield data.The tensor of satellite data is of size 7390 × 26 × 32 × 7, which means that it has 7390 county-years, 26 rows (time intervals), 32 columns (bins), and 7 bands.The tensor of yield data is 7390 × 10 × 1, which means that it has 7390 county-years, 10-time steps (for the previous 10 years), and 1 band (yield data).Therefore, tensors for training the CNN module and LSTM module are produced.5) consists of two essential DL modules, a 2D CNN module, and an LSTM module, followed by two layers, a concatenation layer and a fully connected single layer.The first module, i.e., the 2D CNN, is used to extract spatial and temporal features of the time-series data.In this module, convolution layers with variable kernel sizes ranging from 3 × 3 to 15 × 15 and multilayer convolution ranging from a single to five layers are used.Although batch normalization and dropout are usually considered in the literature following CNN layers it is observed by preliminary examinations that the improvement provided by these layers is negligible.Thus, it is decided not to use batch normalization and dropout to reduce the cost of operation.Following the convolution layer (or layers), the output is flattened and sent into a fully connected (FC) layer (Dense Layer).It is also that there is a strong relationship between previous years' yield data and predicted yield data.For instance, Kang et al indicated that the yield of the preceding five years exhibited a correlation of 0.67 with the present yield [25].In our model, we adopted an even more complex scheme to include in previous yield data.For each 2D CNN input (a sample for the DL network), a corresponding previous ten years' yield data is used as input to the LSTM module since LSTM networks possess the ability to learn the relationships between past and future values.The LSTM module consists of an LSTM layer and an FC layer.The output of the LSTM layer is sent to an FC layer.
Both CNN and LSTM modules produce 8 values per count, per year.The outputs of CNN and LSTM modules are concatenated for gathering both satellite-based data and previous years' ground truth yield data into a single-dimension set of data which is eventually fed into another FC layer to produce the predicted yield.Note that the training session for SEDLA requires ground truth corn yield data (labels) as input where no such input is required at the prediction stage.In the SEDLA, there are 8 essential hyperparameters and DL networking properties being considered: • The number of layers in multilayer Conv2D which is implemented for one to five layers.
• The filter size is determined as 32 for Conv2D layers.• Rectified Linear Unit (ReLU) is used as the activation function for Conv2D and FC layers except for the FC layer that outputs the predicted yield for which linear activation is used.
• 128 LSTM units are used for the LSTM module.
• The learning rate is dynamically decreased using the ReduceLROnPlateau callback from 0.001 to 0.00001.
• Adam optimizer is used for training and early stopping is also used to prevent overfitting.
• The mean square error (MSE) is used as the loss metric.

Model training
In order to predict the yield of a year (y p ), satellite data for the previous ten years (y i ) and ten years of the yield data for each particular year (y i -10, y i -9, K, y i -1) are utilized (figure 6).These data sets were randomly split into two categories: 80% for training and 20% for testing, with this division performed separately for each year.This approach guarantees an unbiased representation across years, as training and testing data are evenly distributed and balanced.
For each county and year, there are 26 time steps for sampling from DOY 113 to DOY 313.In order to investigate the performance of yield prediction we focused on the second half of the lifetime of the crop.Therefore, predictions start with time step 14 which corresponds to DOY 217, and end with time step 26 which corresponds to DOY 313.In order to reduce the computation time every other time step starting off with time step 14 is taken into consideration which ends up with computation for 7 time steps, ranging from August 5th to November 9th.
During the course of training, the number of epochs was set to 100 with a batch size of 16.In DL approaches consecutive epochs are usually expected to converge by producing lower loss rates.However, it can be observed that the loss rate does not improve for a number of consecutive epochs in training sessions.This means that the value to be converged is reached and further epochs would not improve the loss rate.An early stopping technique is used to terminate training when 5 consecutive epochs do not improve the loss rate in the SEDLA.The mean square error (MSE) was selected as a performance measure to monitor early stopping.

Experiments
The experiments based on the SEDLA require additional information in order to give the internals of performance evaluation and model evaluation processes.The essentials of the evaluation of the proposed model are explained in section 5.1 in order to give the internals of the model evaluation process.In section 5.2, the performance of the proposed method with different kernel sizes for different time steps is compared.In section 5.3, the results of the experiments that are performed for the proposed DL architecture implemented with different numbers of layers, i.e. 1 to 5, for different kernel sizes, i.e. 3 × 3 to 15 × 15 are given along with the detailed comparison of the results.

Model evaluation
The essentials of the evaluation of the proposed model are explained in this subsection in order to give the internals of the model evaluation process.The corn yield of years 2019, 2020, and 2021 are predicted for performance evaluation of the SEDLA.When making predictions for any given year, only data from the previous 10 years has been utilized, and the network has been trained separately each time.For example, when predicting for the year 2019, the network was trained using data from the years 2009-2018.Even if the same input data is used to train a deep neural network multiple times, the network may produce slightly different results on each run.Accordingly in our experiments, to evaluate model performance, 5 independent executions are performed for each predicted year and the average of 15 executions for years 2019, 2020, and 2021 is computed.Since the values are close to each other, i.e., the variance is small, taking the average is an acceptable approach.
MSE, RMSE, MAPE, and R 2 are selected for evaluation measures.The equations used for computing MSE, RMSE, MAPE, and R 2 are presented in (1)-( 4) where y t is the predicted value, ŷt is the observed value, ȳt is the average values of observed values and n is the number of samples [44].
We observed that lack of sufficient number of samples is one of the essential problems in yield prediction at county level.Even if the area of interest is extended to the Corn Belt the number of samples is still limited.The relatively small sample size inversely affects machine learning.One of the ways of dealing with this problem is to use a simple DL architecture.
The input data for 2D CNN is a tensor consisting of histogram-based data in one dimension and time in the other.This two-dimensional data structure of the 4D tensor constitutes complex features.In order to make the proposed simple DL architecture to learn these features a number of experiments to observe the impact of varying kernel sizes from 3 × 3 to 15 × 15.It is evaluated that larger kernels give more successful results instead of using small kernel sizes such as 3 × 3 or 5 × 5 as usually used in the literature.Similarly, the performance of small-sized kernels is not performing well even for multilayered architectures.

Comparison of kernel sizes for single-layer CNN
It is already mentioned that the kernel size is an important hyperparameter for feature detection in DL.In this set of experiments, the impact of kernel size is observed in a single convolutional layer implementation of SEDLA for different time steps.
Each prediction run is implemented as a follow-up process of a training run.Thus, 5 training and prediction runs are performed per year and these runs are repeated for years 2019, 2020, and 2021.Each of these prediction runs is performed for 7 different kernel ranging from 3 × 3 to 15 × 15 and for 7-time steps ranging between Aug 5 to Nov 9 for every 16 days.Therefore, each mark in figure 7 represents the average of 15 runs which gives us a total number of 5 × 3 × 7 × 7 = 735 training and prediction runs.Although two modules are running concurrently for yield prediction, the LSTM module is implemented once and left without changing throughout all tests.However, the kernel size of the 2D-CNN module is changed to see its impact on prediction performance.As can be seen in figure 7, larger kernel sizes (9 × 9, 11 × 11, 13 × 13, and 15 × 15) perform better compared to smaller ones.This indicates that it is easier to learn complex features of the histogram-time tensor with larger kernel sizes as bigger portions of input data are considered in understanding the relationships between the different features.
In order to compare the proposed approach to those available in the literature it is preferred to compute three metrics which are RMSE, MAPE, and R 2 as these performance metrics were commonly used in the literature.Figure 7(a) indicates RMSE of the proposed approach in terms of bu/acre, figure 7(b) indicates MAPE in terms of percentage, and figure 7(c) indicates R 2 .The performance of the yield prediction would be better when the prediction day is closer to the harvest.Such a trend can be seen in figures 7(a), (b), and (c) with an exceptional trend change on Nov 09 since some of the MODIS data would include the information after harvesting.
For all three figures, experimental values for 3 × 3, 5 × 5, and 7 × 7 kernel sizes can be grouped distinctively because of their poor performance.It can be seen that kernel size 15 × 15 often performs better than the others and there is a trend that the larger the size of the kernel better the performance.

Comparison of kernel sizes for multilayer CNN
The impact of kernel size in a multilayer CNN implementation is another issue in our research.In this set of experiments, the performance of multiple kernel sizes varying from 3 × 3 to 15 × 15 is measured.Moreover, the performance of multilayer CNN with a smaller kernel size against the performance of single-layer CNN with a larger kernel size is compared.
Although it is possible to run these tests over 3 variables by including layer size in the previous set of tests, the testing time would not be feasible since an increase in the number of layers ends in an exponential increase in execution time.Also, a number of preliminary tests with 3 variables on some randomly selected variables ended in similar results.Therefore, it is decided to perform these tests for a single date.With respect to single-layer CNN tests, kernel sizes 5 × 5, 9 × 9, 11 × 11, and 15 × 15 produced their best performances on Oct 24, kernel sizes 7 × 7 and 13 × 13 produced their best performance on Oct 8, and kernel size 3 × 3 produced its best performance on Nov 09.Among these 3 dates, Oct 24 is chosen as a preset date so that multilayer architecture can be tested over 2 variables, i.e. number of layers and kernel size.
It is clear that small-sized kernels are computationally efficient since they can be processed quickly with regard to their size.Additionally, using multiple small-size kernels can allow for more flexibility in the design of multilayer neural networks.However, there are also some disadvantages to using multiple small-size kernels.An essential disadvantage is that it can require more memory because of the fact that all small-size kernels that belong to different layers need to be stored in memory.Another disadvantage is that the area of the input image that is used to calculate the output of a neuron, i.e. the receptive field of the neural network is reduced.Furthermore, the multilayer use of small-size kernels reduces the receptive field more which makes it the neural network to learn complex patterns in the input data harder.
Deciding whether to use multiple small-sized kernels or a single large-sized kernel is application-dependent [45].Therefore, the tensor used as an input to this particular application comes with some special properties and needs to be tested for the performance of this particular DL architecture.On the other hand, from the perspective of computational complexity small small-sized kernels are known to be providing faster solutions.However, a single large-sized kernel is a better option for learning complex patterns.
Similar to what is done in section 5.2, the test results are evaluated through 3 metrics; RMSE, MAPE, and R 2 .As can be seen in figures 8(a)-(c), increasing the number of layers usually yields worse results regardless of kernel size.Contrary to single-layer CNN implementation smaller-sized kernels such as 3 × 3, 5 × 5, and 7 × 7 outperform larger-size kernels on the perspective of performance.Small-sized kernels are producing almost stable results whereas larger-sized kernels are producing results comprising of higher error margins both with the increasing number of kernel sizes.However, it is important to note that all of the multilayer CNN-based results are worse than their respective single-layer CNN-based results.Therefore one can claim that single-layer CNN is performing better than multi-layer CNN for this particular class of application.

Results
This study investigated the influence of kernel size and temporal variations on the performance of a crop yield prediction model.Larger kernel sizes led to consistently lower RMSE and MAPE values, indicating improved prediction accuracy.This suggests that capturing broader spatiotemporal patterns in the data enhanced the model's ability to minimize overall error.R 2 values also generally increased with larger kernels, implying a better fit of the model to the data.These findings solidify the positive impact of larger kernels on model performance.
While R 2 showed minimal temporal fluctuations, both RMSE and MAPE exhibited variations across dates.These changes could be attributed to randomly splitted training and validation data since the DL models used for yield prediction are data-driven.Significantly, error bars for RMSE and MAPE tended to shrink in later dates, a decrease in model uncertainty over time.As the harvesting time approaches, it is already expected that the model will yield better results since this temporal proximity to the harvest allows the model to better capture and adapt to the dynamic factors influencing the product's growth and maturation.
To evaluate performance of SEDLA five prominent machine learning methods, were considered as the baseline for comparison as shown in figure 12.These methods include Deep Neural Network (DNN) with three hidden layers and 256 neurons, Ridge Regression (RR), Random Forest (RF), Least Absolute Shrinkage and Selection Operator (LASSO), and Multi-level Deep Neural Network (MLDL).The RMSE, MAPE and R 2 values are obtained from the work of Sun et al (2020) using Webplotdigitizer Software.A rigorous assessment was conducted using essential metrics: RMSE, MAPE, and R 2 .The analysis covered significant dates, providing a comprehensive understanding of the models' predictive capabilities.RMSE, a critical metric for assessing prediction accuracy, reveals the deviation between predicted and actual values.Lower RMSE values indicate superior model performance.Examining the RMSE results for each model on distinct evaluation dates: SEDLA and MLDL demonstrated the lowest RMSE on all dates but SEDLA outperforms MLDL starting from 6 September MAPE provides insights into the magnitude of errors as a percentage of actual values.Lower MAPE values indicate higher accuracy.When examining the MAPE results for each model on specific evaluation dates, SEDLA outperforms all the other models.R 2 quantifies the proportion of the variance in the dependent variable that is predictable from the independent variables.Higher R 2 values signify better model fit.When examining the R 2 values, it can be observed that MLDL performs better than SEDLA.However, starting from September 6th, it is evident that SEDLA's values start to increase while MLDL values decrease, and the values for both models begin to converge.It is considered that this may be due to some differences in the pre-processing stage, specifically the counties extracted during the pre-processing stage can be a reason.It is concluded that the preprocessing of the data in the discussed context can also have an impact on the results obtained.
In summary, SEDLA consistently demonstrated superior performance across RMSE and MAPE evaluation metrics on all dates, establishing itself as a highly accurate and reliable predictive model in this analysis.

Discussion
Changes in region, climate, crop, or practice are expected to influence the absolute predicted yield values to some extent.However, due to its reliance on satellite and yield data only, our model should capture similar trends in yield patterns across diverse contexts.This makes it potentially generalizable to different regions and climates, even with limited data, offering a scalable and affordable approach for yield prediction.
When SEDLA is compared with other machine learning methods such as MLDL, DNN, RR, RF, LASSO, it consistently demonstrated superior performance across RMSE and MAPE evaluation metrics on all dates, establishing itself as a highly accurate and reliable predictive model in this analysis.
SEDLA is also compared to DL methods used for corn yield prediction which are given in table 2. Yield prediction depends on many different parameters such as crop genotype, soil, climate, weather, and management practices, most of which change from parcel to parcel [5].Since more accurate information is available at the field level or parcel level, the yield prediction error is lower at the field level and parcel level compared to those at the county level or state level.Therefore, studies conducted in a single state are expected to have lower errors compared to studies conducted in multiple states.When the performances of the DL-based methods proposed for corn yield prediction are examined (table 2), it is observed that the performances of the studies conducted in areas where geographical and soil characteristics are closer to each other and which are relatively smaller than the Corn Belt, such as a single state, are better.Kuwata and Shibasaki used MODIS EVI, CRU Tmax, Tmin, potential evaporation, and pressure data for predicting corn yield in U.S. Illinois State and they obtained an RMSE of 6.30 [4].Kim and Lee proposed a DL method for predicting corn yield in U.S. Iowa State at county level using MODIS NDVI, EVI, LAI, FPAR, GPP, ET, soil moisture, and climate data as inputs.They obtained a MAPE of 6.50 [16].Kuang used CNN and GP methods together for corn yield prediction in U.S. New York State at county level.She used MODIS SR and LST as inputs and got MAPE of 2.98 [32].Kuwata and Shibasaki [4], Kim and Lee [16] and Kuang [32] proposed corn yield prediction models in which performances are evaluated for a single state.The performances of the methods proposed by Kuwata and Shibasaki [4] and Kim and Lee [16] are remarkable since the evaluations are done within the context of a single state.On the other hand, Kuang's proposal performs exceptionally well which may not be explained by focusing on a single state.Although Kuang used the same method proposed by You et al (2017), obtained performance is too good when compared with other studies in the literature.Her study requires further investigation into issues such as the predicted year, dataset selection for training and validation, obtained results are the best or average, results are specific to the applied state, etc.
Tende et al used LSTM architecture for predicting corn yield in Tanzania at a district level.They used MODIS NDVI and climate data as inputs to their network and got a MAPE of 3.66 [34].Although this study has a better performance compared with others it is seen that the monthly average of the lowest and the highest temperature values at the district level do not change much through the evaluation days for the simulation.Moreover, other geographical and meteorological values are also unchanging because of Tanzania's proximity to the Equator.This provides a unique set of features to this particular research.
Because of the data dependency in yield prediction and DL architectures, in order to make a consistent evaluation of SEDLA, it would be appropriate to compare it with similar models that are evaluated in the same geographical region and at the same scale.In this context, when compared to the other studies SEDLA

Conclusion and future work
Utilizing remote sensing data is a valuable asset in forecasting crop yields as it enables the measurement of diverse factors influencing crop growth, including crop development phenology, land cover, topography, and weather conditions.Additionally, remote sensing allows for extensive coverage across large areas over time, and it is relatively cost-effective.Both temporal and spatial considerations play crucial roles in the prediction of crop yields.Existing approaches to crop yield prediction often fail to capture temporal patterns or they propose complex and layered DL architectures with small kernel sizes.Small kernel sizes fail to learn complex features.
Our study indicates that a wider model would be more effective in crop prediction.Increasing the width of a layer enhances the ability to learn complex patterns in the data; however, making multiple layers wide increases the computational cost and brings along drawbacks such as overfitting.Therefore, a single-layer network with a large kernel yields better results.Nevertheless, considering the increased computational cost and potential drawbacks like overfitting, it may be valuable to consider small-kernel layers following a large-kernel initial layer to capture relationships among complex features without being affected by these drawbacks.In conclusion, regarding crop types, it is suggested that the proposed network structure for corn would also work for barley and maize.For different crops such as rice and sugar cane, it is necessary to retrain the network based on their specific characteristics.After retraining, the model can make sufficient predictions for these crops as well.
The predictions made by deep learning models proposed for yield forecasting are entirely dependent on the data given to the model during training.If the model is trained with sufficient heat and drought data, it has the potential to successfully forecast these special conditions.However, the use of satellite and yield data, especially since the 2000s, has led to a very small number of years affected by heat and drought compared to other years, and therefore, very good predictions cannot be made for these special conditions for now.The system needs to see more heat and drought data to make successful predictions.
The proposed deep learning architecture for corn yield prediction consists of two modules: a 2D CNN and an LSTM.The CNN with large kernel sizes (i.e., 15 × 15, 13 × 13, and 11 × 11) extracts spatial and temporal features from MODIS SR and LST data, while the LSTM learns the relationships between past and future corn yield values.The outputs of the two modules are concatenated and fed into a fully connected layer to output the predicted yield.The proposed architecture SEDLA is evaluated for predicting corn yield in U.S. Corn Belt states from 2019 to 2021 at county level.The results show that the SEDLA is able to capture temporal patterns and learn complex features that are important for crop yield prediction.
Although it is advantageous to use many hidden layers in deep learning, a single-layer structure is proposed in order to make architecture simpler.Also, the use of large kernel size gains importance in order to capture the spatial and temporal changes of SR and LST data.
It would be worth to compare other DL studies for predicting corn yield.As shown in table 2, although the yield predictions at a small scale or in a small area (such as only one state) give better results, corn yield predictions in multiple states at county level have similar results each other.
It has been observed that simple architecture and large kernel size perform better, especially when the sample size is small.As the next step of our research, we are investigating the impact of the use of varying kernel sizes in a single DL implementation.In that perspective, the architecture starts with bigger kernel sizes such as 15 × 15, 13 × 13, or 11 × 11 to learn the complex features and then continues with smaller kernel sizes to provide faster yet more efficient computation.The second direction for future work should investigate how simple architectures and large kernel sizes perform when additional data, such as soil, weather, and climate data, are added to satellite data.The stability achieved with small-sized kernels in CNN-based models does not reflect into improved performance since the features considered are complex features.The deep learning network must learn relations about the change of these features' trends in time.These relations are complex and small-sized kernels are not enough for learning such complex relations.Since most of the DL methods proposed are using satellite data which have such a complex relation in time it is worth discussing the effect of using large-sized kernels.In this context, testing the use of large-size kernels in other agricultural products and exploring the application of larger kernel sizes in the initial layers of other CNN-based methods, which employ a more complex structure and incorporate climate, soil, and similar data alongside satellite data, are suggested as further studies that would provide more in-depth insights in this regard.The predictions made by deep learning models proposed for yield forecasting are entirely dependent on the data given to the model during training.If the model is trained with sufficient heat and drought data, it has the potential to successfully forecast these special conditions.However, the use of satellite and yield data, especially since the 2000s, has led to a very small number of years affected by heat and drought compared to other years, and therefore, very good predictions cannot be made for these special conditions for now.The system needs to see more heat and drought data to make successful predictions.

Figure 3 .
Figure 3. CNN module and input tensor for training (y p is the predicted year).

Figure 4 .
Figure 4. LSTM module and input tensor for training.

Figure 5 .
Figure 5. Proposed deep learning training architecture overview.

Figure 6 .
Figure 6.Proposed deep learning prediction architecture overview.

Figures 9 -
11 analyses RMSE, MAPE, and R 2 , often with error bars or standard deviations, provided key insights.

Figure 9 .
Figure 9.A comparative analysis of RMSE with (a) max-min error bars (b) standard deviation error bars.

Figure 10 .
Figure 10.A comparative analysis of MAPE with (a) max-min error bars (b) standard deviation error bars.

Figure 11 .
Figure 11.A comparative analysis of R 2 with (a) max-min error bars (b) standard deviation error bars.

Figure 12 .
Figure 12.Comparative performance analysis of machine learning methods with SEDLA using (a) RMSE (b) MAPE (c) R 2 .
Schwalbert et al used the LSTM approach for predicting soybean yield in Brazil.NDVI, EVI, daytime LST from MODIS, and precipitation from Climate Hazards Group Infrared Precipitation with Stations (CHIRPS) are used as inputs to LSTM [24].Kang et al compared six ML methods including LASSO, Support Vector Regressor, RF, XGBoost, LSTM, and CNN for county-level maize yield prediction in the U.S. Midwest.They used MODIS SR, EVI, GCI, NDWI, LST, ESA CCI SM, PRISM Air Temperature, PPT, Vapor Pressure Deficit (VPD), Standardized Precipitation Index (SPI), Extreme Degree Day (EDD), ALEXI model ET, land surface model results, soil data, and crop progress reports as inputs to these methods.They used three different sets of environment variables (Minimum set, Good set, and Full set) and they computed correlation dynamics between environmental variables with yield [25].Sun et al proposed a Multilevel Deep Learning (MLDL) coupling CNN and LSTM for predicting corn yield in the U.S. Corn Belt at county level.Their study used MODIS SR and LST data together with Daymet weather data as timeseries data, soil property data, and the United States Department of Agriculture (USDA) National Agricultural Statistics Service (NASS) Cropland Data layer are also used as inputs

Table 2 .
Corn Yield Prediction Algorithms and Their Performances.

Table 3 .
Data ranges and bin size of the features.