Pansharpening and spatiotemporal image fusion method for remote sensing

In last decades, remote sensing technology has rapidly progressed, leading to the development of numerous earth satellites such as Landsat 7, QuickBird, SPOT, Sentinel-2, and IKONOS. These satellites provide multispectral images with a lower spatial resolution and panchromatic images with a higher spatial resolution. However, satellite sensors are unable to capture images with high spatial and spectral resolutions simultaneously due to storage and bandwidth constraints, among other things. Image fusion in remote sensing has emerged as a powerful tool for improving image quality and integrating important features from multiple source images into one, all while maintaining the integrity of critical features. It is especially useful for high-resolution remote sensing applications that need to integrate features from multiple sources and hence a vital pre-processing step for various applications, including medical, computer vision, and satellite imaging. This review initially gives a basic framework for image fusion, followed by statistical analysis and a comprehensive review of various state-of-the-art image fusion methods, where they are classified based on the number of sensors used, processing levels, and type of information being fused. Subsequently, a thorough analysis of STF and pansharpening techniques for remote sensing applications has been covered, where the dataset of the DEIMOS-2 satellite is employed for evaluating various pansharpening methods while MODIS and Landsat images are employed in the spatiotemporal fusion method. A comparative evaluation of several approaches has been carried out to assess the merits and drawbacks of the current approaches. Several real-time applications of remote sensing image fusion have been explored, and current and future directions in fusion research for remote sensing have been discussed, along with the obstacles they present.


Introduction
Recently, image fusion has become a relatively new field with the overarching goal of producing a unified image by merging the relevant information from multiple images acquired by diverse sensors.Nevertheless, for image fusion to be effective, it must be capable of extracting every vital detail from the input images while preventing the output image from containing discrepancies or inconsistencies.Furthermore, it is more appropriate for the visual processing and analysis requirements of humans, robots, and other image-related undertakings [1].Satellite imaging, surveillance, remote sensing, photography, recognition, and even medical imaging all rely heavily on image fusion.During the fusion process, significant information from the reference image must be kept intact; artifacts and inconsistencies must be avoided at all costs; and noise and superfluous features must be minimized.Hence, the fused image reduces space and cost and provides wider coverage by suppressing irrelevant features and noise [2,3].With the increasing demand for high-performance, practical, and costeffective image fusion methods in real-world applications, academics have boosted their efforts to develop more effective fusion methods.There are three primary reasons for the rising need for high-performance image fusion techniques: a) a rise in the number and variety of complementary images acquired for various applications; b) technological developments in signal processing; and c) the need for image fusion methods with high throughput and low cost.To facilitate remote sensing applications, an increasing number of satellites are acquiring images of the observed environment at different temporal, spatial, and spectral resolutions and are further used in various applications [4].The objective of image fusion methods is to optimize the exploitation of visually identical images, leading to the development of numerous approaches to image fusion over time [5].There are several technical constraints, such as cost, power consumption, and environmental factors, that could slow down the creation of higher-quality or more specialized sensors.As a solution, image fusion has emerged as an effective approach for combining images obtained from different sensors or camera configurations, offering cost-effective and efficient image fusion methods.The introduction of many effective signal processing techniques in recent times has facilitated significant improvements in the performance of image fusion [4].

Contribution of the survey
Image fusion is a popular area of research in remote sensing image processing, with the goal of enhancing spatial and temporal resolution, change detection, reliability, display capabilities, and system performance robustness.The fundamental issue in image fusion entails the identification of the optimal approach for amalgamating several input images.Numerous image fusion methodologies have been devised to effectively combine a panchromatic (PAN) image and a multi-spectral (MS) image, resulting in an MS image that exhibits enhanced spatial and spectral resolution simultaneously.While previous research has focused mostly on the basic insight of image fusion, including the classification on the basis of the number of sensors used, processing levels, and type of information being fused [3,4,[6][7][8], this review article covers every aspect, including the prevailing state of pansharpening and spatiotemporal fusion methods for remote sensing, their real-world applications, knowledge gaps, and potential future directions.Although numerous authors have examined either pansharpening or spatiotemporal fusion in the domain of remote sensing, they have not conducted an exhaustive examination of both [5,9,10].The primary concepts explored in the article can be concisely outlined as follows: • This article gives an insight into the image fusion process, its levels, and the various technologies of image fusion.
• The article presents a comprehensive survey of the existing pan-sharpening and spatiotemporal image fusion (STF) techniques.It also addresses the research gap in the remote sensing arena.
• Several pansharpening techniques are compared in the article, along with their benefits and limitations.
• The article also compares and contrasts a number of the existing STF methods, highlighting their benefits and limitations.
• The article presents some of the application areas, unresolved issues, and future trends.
The subsequent sections of the article have been arranged in the following order: section 2 briefly gives an insight into the image fusion process, highlighting the current innovative fusion techniques that are currently being used at the pixel, feature, and decision-levels.Additionally, it discusses distinct fusion strategies such as multiview, multi-modal, multi-temporal, and multi-focus fusion.Section 3 provides a comprehensive examination of fusion techniques within the field of remote sensing, classifying them into two primary categories: pansharpening and STF.Some of the application areas of remote sensing fusion are discussed in section 4. Some open issues have been identified, and future trends have been discussed in section 5, followed by a conclusion in section 6.

Image fusion
The creation of a single, unified image is the fundamental goal of image fusion, also referred to as IF for short.This is achieved by the integration of data obtained from a number of distinct images.The primary stages of the IF process are decomposed according to figure 1.During the stage of image acquisition, a single sensor or numerous sensors with distinct modalities will be used to acquire images that are visually distinct from one another or that complement one another [3].The pre-processing phase removes or greatly reduces any noise or distortions that were introduced during the acquisition stage into the unprocessed images.Image analysis can involve a wide variety of procedures, including image registration and image fusion [11], with the end goal being the extraction of valuable underlying information contained in the processed images while simultaneously minimizing cost.Image registration is a challenging task in the field of optimization, which seeks to minimize expenses while effectively using the common features shared by several images.Image registration is the process of aligning the subsequent aspects of a number of different images with regard to a benchmark or constant image.This technique is used to register images from a large number of different sources, with the original image serving as a reference at some point.Subsequently, the extant source images will undergo a geometric alteration process to ensure their alignment with the reference image [12].The result of this phase might contribute to the advancement of analytical procedures, such as image fusion.Image fusion involves generating enhanced and more descriptive images by utilizing the input images.Along with adding noise or distortions to the fused final image, this process may cause some crucial information from the source image to be lost.The examination and evaluation of the fusion algorithm are necessary and can be conducted using quantitative fusion metrics and qualitative vision inspection.

A system for image fusion
There are two main types of image fusion systems: single-sensor and multi-sensor fusion systems.

Single sensor fusion system
A single, high-quality output image is produced by combining the data from several input images in a scenespecific order using a single sensor in a single-sensor fusion system [13].The concept of 'single sensor fusion' pertains to the amalgamation of data derived from several reflectance bands acquired by a singular sensor, but with varying spectral resolutions [14].One example of this would be the Sentinel-2 constellation, which can acquire MS bands from the visible to short-wave infrared (SWIR) areas of the electromagnetic spectrum at resolutions of 10, 20, or even 60 meters.However, human operators cannot perceive desired objects in a lighting variety and loud environment, which can be displayed in the fused image, making this technology unsuitable.Image resolution is also limited by sensors and operating conditions.A visible band sensor, for instance, works best in well-lit daylight environments; however, it is not ideal in poorly illuminated nighttime situations, fog, or rain [9].One disadvantage of using a single sensor is its efficiency, which can cause various issues [13].
Additionally, the sensor's capability is limited to certain settings and scenarios, resulting in decreased resolution, dynamic range, and working conditions [15].

Multi-sensor image fusion
The emergence and expansion of image sensors during the late 1970 s facilitated the advancement of multisensor information fusion, leading to the establishment of branch-image fusion as a novel field of inquiry.This discipline focuses on utilizing images as the central subject of investigation and integrates sensor data with data processing, image evaluation, and machine learning.A multi-sensor fusion system (MSFS) overcomes singlesensor limitations [16] by fusing multiple sensor images into a single image with consistent focus points.Unlike SSFS (single sensor fusion system), MSFS provides a reliable and efficient image fusion system [17,18].Since various sensors collect data at varying temporal and spatial resolutions, combining their outputs results in a more informative image with less uncertainty, error, noise, and improved reliability [19][20][21].In the event of sensor failures or malfunctions in the performance of individual sensors, the approach can still depend on the remaining sensors present in the system [22].For instance, unlike other sensor images like infrared images [23], visible images [24] cannot work correctly in any weather or day/night conditions.Visible light imagery, on the other hand, offers better spatial resolution, clarity, and texture detail than infrared imagery.Considering PAN and MS pictures, PAN images have low spectral density but good resolution, while MS images have the opposite [25].Multi-source image fusion enhances scene description [26] and supplies more useful data for subsequent image processing tasks like localization [27], segmentation [28], classification [29], diagnosis [30], surveillance [31], and agriculture [32].

The hierarchical structure of image fusion
Image fusion techniques may be roughly grouped into three broad categories: pixel-level, feature-level, and decision-level image fusion [33].Figure 2 presents a flowchart that shows the various levels of image fusion.

Pixel-level
Pixel-level image fusion, which includes combining multiple registered source images at the pixel-level, is a fundamental technique for fusing images.Using either the raw pixels from imaging sensors or the coefficients obtained from multi-resolution transforms, this fusion procedure can be executed.The basic goal of pixel-level image fusion approaches is to generate a visually appealing fused image while minimizing computational complexity, allowing for the seamless incorporation of data from many input images into subsequent computerrelated tasks [4].Pixel-level image fusion enables analysis, processing, and understanding in various applications, including military [34], medical diagnosis [35], remote sensing [36], photography [37] and surveillance applications [38].It has been found, however, that fusion techniques operating at the pixel-level are extremely vulnerable to issues like noise and poor registration.For pixel-level image fusion, precise registration between source pictures is crucial [39].However, developing an appropriate image fusion approach is impractical due to the variety of source images and realistic fusion applications.

Feature-level
Feature-level fusion is the process of combining useful characteristics taken from images acquired with multisensor imaging [39].This fusion occurs at an intermediate level, where the extracted features are combined to form new feature vectors that can be subsequently processed.The feature information obtained should be sufficient for the purposes of classification, collection, and synthesis of information from multiple sources.As a result, the data is better suited for real-time processing, and compression is possible without losing too much of the vital information included in the original image.Typical image features include edges, corners, lines, contours, shapes, textures, regions, etc.Therefore, feature-level techniques are used for fusion to avoid issues with pixel-based algorithms [40].Data fusion processing relies heavily on the appropriate selection of a featurelevel fusion technique suitable for the intended application [41].

Decision-level
The highest kind of information fusion is called decision-level fusion, sometimes referred to as interpretation or symbol fusion [42].It seeks to acquire decisions from each source image and merge them into the absolute optimal choice based on characteristics and reliability.The fusion process combines preliminary categorization data, forming a foundation for command-and-control actions [39].There are three primary phases that make up the overall framework [43]: the process begins with the identification and extraction of features from each source image, continues with the application of local classifiers to generate the relevant results, and concludes with the application of decision rules to fuse the results in an effort to strengthen common interpretation and understanding of objects.Examples of areas where fusion approaches are used include remote sensing image categorization and fingerprint verification [44,45].These methods are frequently complex and rely on heuristics, evidential reasoning, fuzzy logic, machine learning, or statistics.Each sensor achieves its decisionmaking objective independently before the fusion, and then an optimal outcome emerges depending on the fusion criteria and the sensors' respective credibility.Decision-level fusion is able to overcome the constraints of individual sensors because it is more efficient and reliable than pixel-level and feature-level fusion [4,41], but a large amount of information is lost during the fusion process [46].

Techniques of image fusion
Image fusion could be classified differently depending on the type of information being fused, the image sensors utilized, and the desired outcome.

Multi-view fusion
The pursuit of enhanced fusion images has sparked extensive study in the domain of image processing.'Multiview' or 'mono-modal fusion' refers to the act of integrating several input images from multiple perspectives into one unified output image at discrete points in time or over a longer period.The multi-scale transformation method [47,48], the sparse representation method [49], and the hybrid method [50,51] are the traditional multi-view fusion methods, but the fusion parameters need to be manually changed.Numerous multi-view image fusion methods, based on Deepfuse, Densefuse, and Nestfuse, aim to enhance infrared and visible fusion performance through innovative neural network models and strategies [52,53].Emerging deep learning-based multi-view image fusion methods outperform older algorithms and allow automatic parameter modification.

Multi-modal fusion
The technique known as 'multi-modal image fusion' combines information from many imaging modalities (e.g.visible, infrared, MS, PAN, and remote sensing) and enhances the image quality.The objective is to integrate valuable data from various sensors into a unified view, thus eliminating redundant information.In addition to being visually appealing to the human eye, fused images are useful for various subsequent tasks [54][55][56].Multimodal image fusion finds applications in medicine, surveillance, and remote sensing [57][58][59][60].The extraction of features is typically the first step in multi-modal image fusion methods, followed by identifying and classifying image elements to determine the most notable features.Many methods, including sparse representations (SR), dictionary learning, and subspace learning, have been proposed to extract meaningful features from images [61][62][63].

Multi-temporal fusion
Multi-temporal fusion includes taking images of the same scene at different times to identify changes or create precise depictions of elements not taken during the required time.Both short-term and long-term observations are necessary to accurately assess the frequency of changes that occur on the ground.As a result of revisiting observation satellites, remote sensing images of a particular region are acquired at various times.Multi-temporal information is essential for various applications, including natural disasters (floods, earthquakes) [64,65], precision farming [66], environmental changes (glacier reduction) [67], and land cover changes [68].

Multi-focus fusion
To comply with the needs of remote sensing applications, high spatial and spectral resolution images are required, but current technology falls short, leading to the utilization of various satellite images integrated into a single, high-resolution image.The fusion of images addresses the issue of limited focus, enhances the ease of interpretation for both humans and automated systems, and offers a more realistic depiction of the situation in some scenarios, yielding a single image with enhanced clarity.Recent years have seen a surge in research into multi-focus image fusion [69,70].This has led to the proposal of numerous methods, including wavelet [71], curvelet [72], discrete cosine transform [73], and so on.Unfortunately, many multi-focus image fusion methods exhibit artifacts like ringing and misregistration of boundary pixels, which dictionary-based sparse representation algorithms can overcome.Convolutional sparse representation (CSR) [74] fixes the problems with sparse representation related to losing details and being easy to misregister.It finds the sparse parameters for the whole image instead of just a few discrete patches.

Remote sensing image fusion
Due to the wide variety of earth-observing satellites (such as World View-2 and QuickBird), remote sensing image fusion (RSIF) has become a prominent subfield in image fusion.Satellite images are either PAN images or MS images.Although MS images were optically acquired in multiple spectral or wavelength intervals, PAN images were captured in the full visible spectrum but converted to black and white.PAN images possess a superior spatial resolution but exhibit a limited spectral resolution.RSIF creates high-resolution MS images using spatial, spectral, and temporal data, which is more useful for a number of applications, such as lithology research, vegetation identification, and urban mapping, among others.Considering the trade-offs between resolutions, image fusion techniques can be broadly put into two groups: pansharpening, which combines fine spatial and fine spectral resolutions, and STF, which combines fine spatial and fine temporal resolutions.Schematic illustrations of these techniques are given in figure 3 and are discussed below.

Image pansharpening
Image pan-sharpening approaches are broadly divided in the literature into four distinct groups: component substitution, multiresolution analysis, super-resolution, and variational optimization (VO)-based methods.Several methods are compared and presented in table 1.

Component substitution
The component substitution method decouples the spectral and spatial information components of MS bands by applying a predetermined transformation [75,76] on the entire image.The resulting image is then subjected to histogram matching, and the entire dataset is inverted back to the original domain.Gram-Schmidt and principal component analysis (PCA) are popular component substitution approaches [77,78].It was proposed to use generalized IHS (GIHS) [79] and adaptive IHS (AIHS) [80] to produce coefficients for improved fusion outcomes in real-world situations.Such techniques have found widespread application in practice due to their great processing efficiency and low computational cost.Despite providing high-quality spatial information, component substitution-based approaches typically create spectrum distortion in the pan-sharpened MS data.

Multiresolution analysis
The spatial pan-sharpening issue is also addressed by another set of techniques called multiresolution analysis (MRA).First, each source image is split into high-and low-frequency subbands at distinct sizes and directions.
Thereafter, redundant and supplementary information from the image level is combined using fusion rules chosen based on the properties of corresponding subbands.These are subsequently retrieved and given at the identical level of the decomposed MS image [75,81].The joined subbands are subjected to inverse decomposition, which results in a unified image.MRA-based algorithms typically have good spectral performance despite the prevalence of CS-based methods [82].These methods successfully fuse scaledecomposed images, extracting features while reducing the aliasing of artifacts.
Table 1.Summary of the image pansharpening method.

Fusion techniques Year Advantages Drawbacks
Component substitution-based methods PCA [195] 2001 Simplified and enhanced, with superior spatial quality and less processing time required.

Color distortion and degraded spectral characteristics
Brovey [196] 2002 Improvements in spatial resolution, efficiency, and processing speed.
Spectral distortion, resolution enhancement in limited bands IHS [79] 2004 Effective implementation, rapid processing, and high spatial quality Unable to enhance certain image characteristics Adaptive GS [197] 2007 Enhanced spatial resolution, excellent performance Unable to fix spectral mismatch-caused by local image discrepancies Multiresolution analysis-based methods SFIM [198] 2000 High-speed real-time image fusion and visualization Unable to illustrate spectral preservation's superiority over other methods when spectral content is at stake AWLP [199] 2005 Considering sensors' physical spectrum responses produce images closer to the ideal sensor's image Neglecting factors like on-orbit operating conditions and scene fluctuation may drastically affect the nominal spectral response MTF-GLP [200] 2006 Effortless compatibility with the sensor's MTF Lower sharpness and spatial enhancement due to subpixel misregistration NNDiffuse [201] 2014 Preserve significant spatial and spectral features Neglecting negative diffusion weights affect subpixel spectral accuracy GLP-HS [202] 2015 Spectral consistency Improving the spatial resolution of each band is challenging Super-resolution-based methods

CS-based pansharpening [104]
2010 Efficiently reconstructs a signal with minimal data loss May produce many unpredictable deviations SparseFI [105] 2012 Avoid the cost of dictionary construction Quality degradation if sparsity prior does not fit CS-based Spectral Distortion Reduction [203] 2015 Effective implementation, less memory and computation time and preserves the spectral information Challenges with sparse coding and dictionary creation, computationally complex VO based methods P + XS [204] 2006 Preserves spectral details Suffers from some blurry Bayesian inference [107] 2008 Pansharpening approach with a Bayesian framework Difficult to jointly characterize the result TV [205] 2013 Provide noise-free results while preserving fine detail Difficult real-time performance due to time complexity GDF [206] 2013 Retain useful features and edges of source images Images may degrade due to indecent structural constraints FE [207] 2014 Estimate a spatial degradation filter Introduces scene dependent error AWJDI [208] 2018 Minimize spectral/spatial resolution Computationally expensive 3.1.3.Super-resolution Spectral super-resolution has proven to be a prevalent framework for current RSIF applications such as MS super-resolution and hyperspectral super-resolution.The super-resolution methods used by Pansharpening can be broken down into two classes: those that rely on learning for reconstruction and those that rely on compressive sensing.In recent years, learning techniques have spread to nearly all applications in remote sensing imaging, including MS pansharpening [83][84][85].An autoencoder scheme [83] is adapted for the sparse denoising task by applying a deep learning-based method for pansharpening.Pansharpening NN (PNN) solves the single image SR problem and is inspired by the SR CNN [86].As a result, CNNs have surpassed PNN as the method of choice for DL-based pansharpening [87][88][89][90][91] due to their increasingly deep and wide architectures, which require fine-tuning a large number of parameters during training to achieve optimal results.Thus, residual learning has been widely applied to pansharpening to mitigate gradient vanishing and explosion and speed up network convergence, but ML-based approaches lack generalization [92,93].As a result, new network architectures and preprocessing operators are being developed and designed to enhance the generalizability of ML approaches [94,95].There has been a lot of research on the theory of compressive sensing owing to its possible usefulness in retrieving images and signals from their compressed equivalents [96,97].It has been shown that high-spatialresolution hyperspectral images can be recovered from hyperspectral and LRM images using the principles of compressive sensing [98][99][100][101][102].Under the premise of sparsity, it creates a powerful and efficient algorithm for solving the associated models [103].

Variational optimization
Recent years have seen a surge in the study of VO-based approaches, which seek to strike an equilibrium in both spatial and spectral quality by framing an optimization problem depending on an assumed observation model with certain priors.As an important subclass of the pansharpening family, the VO-based technique relies on the solution of an optimization problem, which in turn necessitates two crucial steps: the design of the energy functional and the solution to the optimization problem.The sparse representation [104][105][106], and the observation model [107][108][109][110][111] are the most widely used approaches for building the energy functional.Modelbased approaches obtain the energy functional by taking into account observation models that map the perfect fused image to the imperfect observations.Conversely, sparse-based methods rely on the concepts of sparse representation theory, where the fused image is modeled as a linear combination of a small number of fundamental elements taken from an overcomplete dictionary or a basis [112].
In most cases, an iterative optimization algorithm [113], such as the gradient descent algorithm [111], the split Bregman iteration algorithm [107], and the alternating direction method of multipliers (ADMM) algorithm [114,115], are used to find the optimal solution for the fusion model.A fast fusion method based on the Sylvester equation (FUSE) [116] and a more robust algorithm, R-FUSE [117], are also based on Bayesian fusion, where R-FUSE requires fewer computational operations but fails to handle spatial misalignment.Based on Bayesian fusion, Qi Wei et al suggested a fast fusion approach using the Sylvester equation (FUSE) [116] and a more robust algorithm, R-FUSE [117], the latter of which takes fewer computational operations but is unable to deal with spatial misalignment.Due to their rigorous mathematical foundation and unambiguous physical implications, VO-based approaches are superior in real-world applications; they yield high precision results, albeit at the expense of greater calculation cost and complexity [118].Furthermore, in practical situations, linear features like the gradient are insufficient to characterize the intricate connection between the merged PAN and MS images, hence hindering the efficiency of VO-based fusion methods.

Dataset and comparative analysis
The dataset employed for conducting a comparative analysis was acquired on May 30th, 2015, precisely over Vancouver, Canada (latitude 49°15 N, longitude 123°6 W), utilizing the DEIMOS-2 satellite [119].The dataset is visually shown in figure 4.An MS and PAN image containing four bandsspecifically, red, green, blue, and near-infrared bands, or R, G, B, and NIR bands, respectively-are included in the dataset.The PAN image is 2000 square kilometers in size and has a resolution of 1 m.Conversely, the chosen MS image measures 500 × 500 × 4 and has a resolution of 4 m.The fusion procedure results in the generation of a multi-band image with high spatial resolution, measuring 2000 × 2000 pixels, and containing four spectral bands.
In this comparative study, various fusion methods are used, including Laplacian, Wavelet, PCA, P-XS, R-FUSE, GSA, Bayesian, and CNMF.From qualitative assessment, it is observed that GSA and PCA have high spectral distortion but better visual appearance.Wavelets and the Laplacian method provide better spectral details, but these methods lead to color distortion.Also, Laplacian may produce low spatial performance with evident blur effects, while wavelets may suffer from artifact generation.Since CNMF does not consider spatial misalignment, it presents a significant spectral distortion and produces overly smooth spatial structures.R-FUSE cannot handle large misalignments, has noticeable color distortion, and generates blurry edges.Bayeʼs methods also produce some spectral distortions.The P+XS method has limited spectral resolution, where some highlighted areas get lost or get weakened.Common fusion evaluation methods of remote sensing images for quantitative assessment include the following: quality index (Q2 n ) [120], universal image quality index (UIQI), spectral angle mapper (SAM), correlation coefficient (CC), peak signal to noise ratio (PSNR), root mean square error (RMSE), relative global-dimensional synthesis error (ERGAS), and relative average spectral error index (RASE) [75,121,122].Table 2 shows the quality metrics for performance comparison evaluation and the fusion results are displayed in figure 5. Higher PSNR, CC, UIQI, and Q2 n values and smaller SAM, ERGAS, RMSE, and RASE values indicate greater accuracy.Better spectral quality can be inferred from a greater value of CC, UIQI, and Q2 n , all of which are indicative of the merged image.
Lower value, ideally 0 Computes fused image quality, with higher values indicating distortion and lower values indicating consistency with the reference image.SAM SAM ( ) Lower value, ideally 0 Determines the spectral similarity Lower value, ideally 0 Indicates the spectral band average performance of the fusion algorithm.PSNR PSNR Measure the spatial reconstruction quality of each band.
Lower value, ideally 0 Good indicator of the spectral quality of fused image UIQI Shows the spectral and spatial distortions in the fused image Computes the similarity of spectral features between the reference and fused images.
Jointly evaluates the fused image's spectral and spatial distortions.

Spatiotemporal image fusion
Studies of earth system dynamics rely heavily on high-quality remote sensing data for tasks like crop monitoring and estimation [123][124][125], atmosphere monitoring [126], identification of variations in land cover and use [127], and ecosystem monitoring [128].While advances in remote sensing have led to more precise satellite observations, however, this progress is limited by factors such as the technological capabilities of satellite sensors and the financial resources allocated for satellite launches.Therefore, acquiring information with a high geographical and temporal resolution can be challenging [129,130].To quickly monitor the surface or atmospheric environment, high-resolution, time-stamped remote sensing images are needed [131].In the past two decades, considerable progress has been made in spatial-temporal fusion techniques, facilitating the creation of satellite images with better spatial and temporal resolution by fusing coarse and fine images.The majority of review studies categorize current approaches into multiple groups based on a survey of the literature [131].Weighting function-based algorithms anticipate an image by weighting surrounding pixels that match certain similarity requirements.To deal with uncertainty in the input images, Bayesian-based algorithms are created in accordance with Bayesian estimation theory.Unmixing-based linear mixture models can be used to estimate the fine target pixel reflectance.Learning-based strategies have garnered significant interest and are seeing substantial growth.A variety of STF techniques are summarized in table 3.

Spatial weighing-based method
The premise of weighting methods, which operate in the spatial domain, is that pixels of the same class or that are in close proximity together have similar values.Once identical pixels from reference images have been identified, it is possible to use the pixel values of those pixels to estimate the predicted image's location and weight it with those values from the low-resolution image.The spatial and temporal adaptive reflectance fusion model (STARFM) [132] involves merging neighboring pixels with comparable characteristics using weights determined by factors such as spectral difference, temporal difference, and geographical distance to generate higher-resolution images.Although the method is effective in predicting reflectance if coarse-resolution homogeneous pixels are present, it has some limitations, such as its inability to handle heterogeneous, finegrained landscapes.The improved STARFM method [133] improves prediction in diverse landscapes by using different conversion coefficients that reflect the variations in reflectance between fine and coarse spatial resolutions.ESTARFM [134] is more computationally intensive than STARFM and cannot effectively predict short-term, transient change, leading to fuzzy borders and objects whose shape changes over time.For both temporal dynamic regions and heterogeneous regions, non-local filtering is suggested as a more reliable and accurate way to predict the target image [135].This strategy, on the other hand, does not consistently produce reliable results.The challenge in STF is exacerbated by the existence of significant temporal changes or the unavailability of cloud-free fine spatial resolution images near the period of projection, and it can be overcome by Fit-FC [136], which entails regression model fitting (RM fitting).Strong temporal shifts are difficult to forecast, but they can be detected via methods such as spatial filtering.One popular method for fusing spatial and temporal information is spatial unmixing (SU) [137].Nevertheless, the diverse range of landscapes necessitates that distinct neighboring coarse pixels make varying contributions to the core pixel.The inadequate incorporation of spatial variation in land cover within pixels poses a substantial obstacle to contemporary remote sensing technologies.To address the spatial variance in land cover and improve STF accuracy, a Limited performance in a heterogeneous landscape, empirical weight function STAARCH [209] Improved performance than STARFM Prediction accuracy depends on region ESTARFM [133] Excellent performance in heterogeneous regions Predicting the temporal change rate necessitates two pairs of coarse/fine images.NLF [135] Anticipate the target image precisely and robustly Unreliable results RASTFM [210] Simple, robust and misregistration tolerance, consider both gradual and abrupt changes Inaccurate abrupt reflectance predictions.

Unmixing-based methods MMT [138]
Fuse images with different resolution Lack within-class variability, and are unsuitable for time-variant locations.STDFA [211] Eliminate negative and outlier problems Fixed size window MSTDFA [141] Predicting land-cover-type variation using adaptive window sizes and varying steps is challenging.A single set of coarse-and fine-resolution images is needed.
More computational time and degraded performance STFGAN [156] Suitable for large-area land cover with shape change Blur boundaries without change STFDCNN [148] Improved resolution, and prediction accuracy.Insufficient prior knowledge StfNet [149] Temporal dependence predicts the unknown subtle difference images increases memory requirements, parameter counts, and training times.

Hybrid approaches STRUM [213]
Retain spatial and spectral properties while integrating temporal traces Large computational cost, less computation efficiency compared to STARFM and Fit-FC FSDAF [161] Provide a fused image in a relatively swift manner, fast running speed Compromised performance IFSDAF [162] Improve the resolution of images Seamless spatial prediction results, leading to the loss of spatial details SFSDAF [163] Improved prediction accuracy Computational burden, uncertainty in prediction model geographical weighting (GW)-based SU technique [137] is used; however, using nearby pixels for prediction may cause image blurring and the loss of high-frequency features.

Spatial unmixing-based method
In spatial unmixing, the high-resolution image pixels are linearly fused with the end members of low-resolution images.Through the process of pixel separation, the estimation of fine pixel values becomes feasible.Early studies on unmixing-based STF methods include the multi-sensor, multi-resolution technology [138].It classifies preceding fine-resolution images, assuming the classification map classes linearly blend coarse pixels.Subsequently, it unmixes coarse pixels at the estimated time within a moving frame to obtain each class's reflectance change, finishing the prediction.MMT is a benchmark for unmixing-based STF algorithms where the cost function is modified [139] by incorporating a penalty term so that the solved endmembers are roughly equivalent to their corresponding predefined endmembers.Then, to improve prediction accuracy, the spatialtemporal data fusion approach (STDFA) is given [140], which computes reflectance differences in a moving window by accounting for both spatial variation and nonlinear temporal change similarity.To further enhance STDFA, MSTDFA [141] applies adaptive moving window size.

Bayesian-based method
Bayesian methods are adaptable replacements for conventional techniques; they account for uncertainty in both the imaging model and the predicted images.Bayesian methods have demonstrated considerable potential in the fusion of remotely sensed images in the spectral and spatial domains, so it is reasonable to investigate their application to STF.The Bayesian framework characterizes the interaction of images with different resolutions from different equipment using probability models.The ultimate objective is to obtain the required fine image by optimizing the conditional probability with respect to the coarse and fine input [142].Incorporating the data fusion problem within a well-defined probabilistic framework is the key advantage of a Bayesian method.The information about temporal correlations is used to guess the target image by bilinear interpolation of the coarse image with the highest a posteriori probability in a Bayesian-based data fusion method [143].To address this issue, a multi-dictionary Bayesian Spatio-Temporal Reflectance Fusion Model (MDBFM) [144] was proposed, where a Bayesian framework is employed to train multiple dictionaries from regions with varying classes.
Hypersharpening [145] is a wavelet-based technique proposed for fusing MS and HS images.A unified fusion method [146] can accomplish both spatial and spectral fusion (SSF) and STF in one process, while the NDVI-Bayesian STF Model (NDVI-BSFM) [147] is used for applications including change detection, land mapping, biodiversity monitoring, and biogeochemical quantification.The utilization of these applications necessitates the acquisition of NDVI products that possess a high level of spatial resolution and are obtained at regular intervals.More research is required to reduce the impact of the angle and simplify linear additivity in spacetime.

Learning-based method
Learning-based spatio-temporal image fusion (STIF) approaches are a recent and dynamically evolving area of research.There has been a growing emphasis on deep convolutional neural networks (CNNs) within the domain of spatiotemporal data fusion.The STIF approaches employ several machine learning algorithms, such as deep learning [148,149], artificial neural networks [150], and extreme learning [151].
Deep learning can represent intricate temporal dynamic changes in imagery and capture spatial correlations in STF by using machine learning approaches to predict an unseen, high-resolution image.These methods execute the STF task in a general super-resolution framework without specifying the type of temporal variation [152] and have attracted increasing attention because of their satisfactory prediction performance [153,154].By utilizing dictionary pair learning, the SPSTFM (matrix sparse representation-based spatiotemporal reflectance fusion model) [155] generates a reflectance variation relationship between coarse and fine images.Following SPSTFM, many improvements are introduced, such as using only one pair of fine and coarse-resolution images [156], as well as the augmentation of fusion through the incorporation of structural sparsity [157][158][159].SPSTFM is modified by integrating Gaofen-2 and Gaofen-1 data using a spatially improved training approach [152].Deep learning techniques have produced excellent outcomes and can efficiently learn feature relationships in spatiotemporal data.Fusion accuracy and robustness are boosted by enhanced deep convolutional STF networks (EDCSTFN) [160] and DCSTFNs [153], leading to more consistent and high-quality visuals.The use of deep learning modeling architectures has the ability to significantly increase STF by learning spatial or temporal feature representations.

Hybrid method
Numerous hybrid approaches have been devised through the amalgamation of diverse approaches.The Flexible Spatiotemporal Data Fusion (FSDAF) model [161] combines unmixing, weight function-based, and thin plate spline interpolation methods to predict land cover class changes over time.Nevertheless, the achievement of both high temporal and spatial resolution simultaneously poses challenges owing to constraints imposed by technical or budgetary factors.The Improved Flexible Spatiotemporal Data Fusion (IFSDAF) technique [162] incorporates constrained least squares theory into the FSDAF methodology.Sub-pixel class fraction change information (SFSDAF) [163] increases prediction accuracy by identifying image reflectance changes, especially in locations characterized by heterogeneity, when a transition in land cover class happens.The original STARFM model's weight computation was improved by including sensor observation disparities for different land cover types [164].Even so, the fine spatial resolution of the data necessitates an unsupervised categorization step first.Though hybrid STF approaches have better generality, they employ spectral unmixing, which is susceptible to fluctuations in radiation.Due to their complexity, they are only useful for small areas.

Dataset and comparative analysis
The dataset used for comparative analysis consists of two pairs of Landsat and MODIS surface reflectance images, covering an area in Coleambally, New South Wales, Australia [165].The Landsat images were acquired on July 2, 2013, and have a resolution of 30 m.The 500 m MODIS surface reflectance images were acquired on August 17, 2013, for 8-day synthetic products.This comparative study uses various STF methods such as STI-FM, Fit-FC, STARFM, FSDAF, STDFA, HCM, and residual CNN, where the fusion results are compared with these state-of-the-art methods, as shown in figure 6. Various fusion evaluation techniques, including RMSE, CC, and UIQI, are employed for analysis.Various bands are compared using these evaluation parameters and  are shown in figure 7. Also, the above-mentioned methods are compared using these evaluation parameters, as shown in figure 8.
The Fit-FC approach instantaneously fits the image with high resolution to the forecast period using the lowresolution image's linear coefficients, leading to the block effect because of the large spatial resolution disparity between the two images.STDFA fusion outcomes are influenced by the assumption of consistent temporal variation features; however, real-world applications may introduce inconsistencies.The FSDAF prediction accuracy is poor because the high-and low-resolution data have vastly different spatial resolutions, resulting in a more refined high-resolution pixel-represented area.Smooth fusion results are produced when there are few categories, whereas a large number of categories reduces fitting accuracy and thus increases total prediction error.Outliers interfere with STI-FM, reducing prediction accuracy when spatial factors fluctuate considerably.The diverse region has a considerable impact on gradation mapping; hence, HCM did not perform well.STARFM assumes that the spectrum of identical pixels within a particular area remains stable, which provides it with reliable prediction accuracy; nevertheless, the model is subject to phenological and environmental variables because it fails to account for changes in land cover across the observation period.While residual CNN has a high computational cost, it quickly extracts features from residual and low-resolution images and creates a mapping relationship between them using a residual learning network, improving visual effects.

Applications of remote sensing fusion
Remote sensing techniques provide crucial coverage, mapping, and classification of land cover elements such as plants, soil, water, and forests, making them useful tools for monitoring the Earth's surface and atmosphere [5].Advances in sensor technology for high spatial and temporal resolution systems have made a plethora of satellite sensor imagery available in recent years.This data includes images with a wide variety of resolutions, acquisition times, frequencies, and polarizations.The purpose of multiple sensor data fusion is to build a unified picture that may be used to better understand the entire scene by combining information from multiple sensors.It has numerous remote sensing applications, including classification, change detection, and maneuvering target monitoring.

Classification
Remote sensing applications use numerous source images to improve image classification [26].A fusion rule is used to combine classification results from different types of sensors, like microwave and optical, because they each provide useful information [166].Image fusion techniques will improve land use and cover classifications by using complementary data that displays temporal repetitiveness or high spatial resolution.One of the example of land cover classification is classifying intra-urban land cover using high spatial resolution images [167].This example leverages Quickbird images to demonstrate a case study for the intra-urban land cover classification for the southeast Brazilian metropolis of Sao Jose dos Campos-SP that encompasses an overall area of about 1,099.60 square kilometers, of which 298.99 square kilometers are urban.Urban analysis with highresolution images is then carried out using a PCA-based fusion of a panchromatic image (0.6 m) and a multispectral image (2.4 m) with four bands (blue, green, red, and infrared).A portion of the panchromatic, multispectral, and fused images can be seen in figures 9(a)-(c), where the decision tree approach was used for the classification.Figures 9 (d) and (e) display the original multispectral image and the classification that was obtained.It was found by Sarkar et al [168] that using data fusion techniques like artificial neural networks (ANN) and the Dempster-Shafer theory of evidence led to better performance in classifying land-use and land-cover than using traditional methods.

Change detection
Change detection is crucial in regulating natural resources and urban development by quantifying demographic dispersion [169].Through the use of many platforms equipped with sensors, image fusion is able to identify subtle changes and present a broader view of how these variations have occurred over time.Changes to structures, land cover, and vegetation can be automatically recognized using high-resolution, multi-temporal, multi-spectral airborne data [170].
For change detection over the urban region of Xuzhou city, researchers utilize multi-temporal and multiresolution QuickBird (QB) images from multi-temporal remote sensing images [171].Multispectral and panchromatic images have 2.44 and 0.61 m spatial resolution, respectively.Figure 10 displays the 2004 and 2005 true color composite images of the research area.Pan-sharpening improves spatial resolution while preserving spectral information, making ground objects and details more precise, hence, more minute and subtle alterations can be identified.

Tracking of maneuvering targets
The monitoring of maneuvering targets is a fundamental undertaking within the realm of intelligent vehicle development.The advancement of signal and image processing techniques, along with sensor technology, opens the door to operational automated moving target tracking.Multi-sensor fusion, on the other hand, is a big area of study in autonomous robots, military applications, and mobile systems [172] because it makes tracking more efficient.Optical and SAR data are the predominant data types used in satellite remote sensing where optical images are limited by light and weather, while SAR sensors can detect all weather, penetrate clouds, fog, and are not affected by shadows or light time [173].The divergence in optical and SAR imaging mechanisms, however, results in distinct approaches to target recognition and detection.While detecting ship targets, the sea-land border zone is identified by utilizing the disparities in reflectance between the sea and the land [174].Figure 11 shows an example using the Yokosuka and Santiago ports [175], where the sea-land separation is achieved by randomly picking seed locations on the sea surface and then growing them regionally.Optical remote sensing images incorporate processing for sea-land separation due to differences in reflectance between land and sea, with the former having a lower brightness and the latter a higher one.In recent years, radar and image sensor fusion in target tracking has improved positioning accuracy and narrowed the image working area [172,176].

Earth monitoring
The maritime environment is particularly challenging due to the dynamic nature of the sea and the constantly shifting sea conditions, and the wide variety of items and their appearance, camera motion, and geographic locations further complicate the detection process.In extreme weather conditions like snow, fog, or heavy rain, an IR camera can detect warm objects at night and targets better than a RGB camera.However, multi-sensor fusion improves target detection by adding redundancy in case of sensor failure.Significant in terms of longterm slope failure hazards and slope erosion processes, earthquake-induced landslides impact hundreds of square kilometers [177] and the spatial-temporal changes in vegetation after significant earthquakes that consequently decrease the NDVI (normalized difference vegetation index) are unknown [178,179].These spatial-temporal changes in vegetation after significant earthquakes are provided by remote sensing satellites.Landslides in urban and forest areas inflict significant destruction upon structures and vegetation exerting a substantial influence on both human settlements and agroforestry [180].Cloud, water, and snow in remote sensing images reflect more strongly in the red band than in the NIR band, so their NDVI values are higher than those of vegetation.Rock and bare soil areas reflect similarly in both bands, so their NDVI values are close to zero.A region of Eastern Iburi, Hokkaido, Japan that experienced landslides is illustrated by the region where Typhoon Jebi, a category five storm, struck on September 5, 2018, followed by a magnitude 6.56 earthquake the next day.Figure 12

Precision agriculture
Image fusion in agriculture can improve productivity and efficiency by detecting crops, identifying diseases and pests, and estimating planting areas [182].Precision agriculture is a developing concept that often requires highresolution spatial data to monitor crop health during the growing season.This is achieved through the installation of sensors on satellites, aircraft, or ground apparatus, which are utilized in conjunction with remote sensing to gather and analyze data on the properties of crops and soil.A precise and timely estimation of crop productivity could potentially provide advantages to farmers in terms of cash flow budgeting, harvest planning, and storage requirements [183,184].The precision farming system relies on crop monitoring technology.Multispectral remote sensing satellites evaluate crop health, growth, and possible difficulties, enabling farmers to make informed decisions about irrigation, insect management, and nutrient delivery [185].Crop rotation can enhance soil health; however, monitoring it necessitates spatiotemporal evaluation of crop-specific management.Remote sensing facilitates planned crop rotations by tracking soil health and nutrient content.The effects of climatic conditions on crop yields in Rio Grande do Sul, southern Brazil, were examined using a satellite-based data fusion approach to map crop rotation at field scale and assess crop rotation patterns among mesoregions [186].A satellite-based data fusion crop type classification and mapping method extracts spatial crop attributes across four growing seasons (2017-2021) as shown in figure 13 and delineates field boundaries to create the crop rotation database.The primary grain crops cultivated in the state include soybeans, corn, and rice.Unfortunately, many satellite-based techniques lack significant field datasets for independent validation, which could limit the research.

Climate Warming
Large-scale multispectral and multitemporal observation data may be obtained using remote sensing, making it a useful technique for comprehensive monitoring of glaciers, forest fires caused by climate change and land surface phenology [187][188][189].Satellite-retrieved land surface phenology (LSP) is vital for monitoring the ecological environment as well as human and societal sustainable development since it plays a key role in energy exchange, the Earth's water cycle, and the carbon balance [190,191].Climate change has made carbon control a major concern for ecological, environmental, and sustainable development [192].Therefore, monitoring seasonal and annual land surface dynamics is crucial for ecological, human, and social sustainability.However, a single satellite sensor cannot produce remote sensing images with high temporal and spatial resolution for LSP extraction, and the vast size of these images limits data processing and computing [193].Therefore, dense timeseries data with high spatial resolution is needed to retrieve LSP and can be accomplished by blending high temporal resolution but low spatial resolution data with high spatial resolution but low temporal resolution data.In this example, Landsat and MODIS images are considered for spatiotemporal fusion.Figure 14 illustrates the spatial distribution features of the Start of Season (SOS) over the contiguous United States.It includes the SOS data obtained from MCD12Q2 and MOD09Q1 using the AT method using the fused EVI2 time series.The SOS values produced from the fused EVI2 in figure 14 closely align with the existing MODIS phenology products.However, there may be some data gaps due to significant cloud contamination and the limited availability of high-quality images for spatiotemporal fusion.The ESTARFM and EVI2 models accurately forecast the spatial distribution features of the red bands and NIR bands.The fusion of images using the proposed reference image selection rule yields similar results to the actual ones [189].

Open issues and future trends
Numerous techniques for pansharpening and short-term forecasting (STF) in the context of remote sensing applications have been extensively examined and evaluated.The comprehensive analysis reveals that pansharpening approaches have a substantial impact on real-time remote sensing imaging systems.Spatiotemporal data fusion represents a contemporary area of investigation within the realm of remote sensing image fusion.Below are some emerging challenges and future trends for both approaches.In the last few decades, there has been a significant advancement in the field of pansharpening methods, leading to the development of several techniques.Instead of this, there are still many issues left that need to be addressed.
• Methods such as PCA and Brovey transformation exhibit reduced computational complexity and expedited processing durations, but at the expense of color distortion.
• Wavelet-based methods minimize color distortion, but they often have more computational complexity.
• Variation of spectral response between MS and PAN images produces less correlated fused images, and hence unsatisfactory fusion performance is obtained.
• Due to misalignments with moving objects, spatial artifacts in the fused image are introduced.
• The CS-based methods have less MS-to-PAN misalignment, which produces improved spectral fidelity of the fused image to some extent.• Some pansharpening methods developed so far have higher fusion accuracy but may produce spectral distortion and low efficiency, which can hinder their application.
To overcome the above constraints, there are some future directions.
• More effective pansharpening methods must be developed to address the disparity in spectral response between multi-spectral (MS) and panchromatic (PAN) images.
• Additionally, it is necessary to investigate pansharpening techniques that are more resilient to MS-to-PAN discrepancies.
• There is a need to develop additional application-focused pansharpening techniques that can yield reliable fusion outcomes and should have high fidelity.
Methods such as PCA and Brovey transformation STF address the challenge of generating time series data of superior quality, which necessitates observations with high spatial and temporal resolution.Instead of this, there are still many issues left that need to be addressed.
• Predicting changes in land cover is inherently problematic due to the significant discrepancy between time series with coarse and fine spatial resolution.However, most existing methods assume no land cover change.
• Disparities might potentially emerge between data with fine and coarse spatial resolutions as a result of divergences in sensor quality, ambient influences, and acquisition geometry.Consequently, the lack of certainty is communicated directly through the method of spatial and temporal fusion (STF).
• In practical applications, capturing information on land objects that rapidly evolve and compensating for various sensor biases (such as registration errors and sensor characteristic differences) is still difficult.Many methods have been developed to remove such errors, but they cannot be avoided entirely.
• Fused images may get blurred due to false spectrum prediction.For heterogeneous regions, the STF methods based on learning exhibit outstanding performance and higher prediction accuracy.However, they are challenging to use in practical applications because they require extensive training and a lot of computing power.
In order to mitigate the constraints outlined previously, several potential avenues for future exploration can be identified.
• It is required to develop more hybrid methods to provide improved fusion results with enhanced accuracy.
• High-resolution images help to recreate sub-pixel land object changes.
• Reducing the computational time of deep learning methods results in better fusion performance.
• More precise techniques must also be created to address the disparity between time series data with coarse and fine spatial resolutions.
• Fused images may get blurred due to false spectrum prediction.For heterogeneous regions, the STF methods based on learning exhibit outstanding performance and higher prediction accuracy.However, they are challenging to use in practical applications because they require extensive training and a lot of computing power.

Conclusion
Over the course of time, there has been a growing need to combine the efficient and noteworthy attributes from several image sources.This is due to the constraints and disadvantages associated with employing a solitary sensor for taking a series of images.This survey provides a statistical evaluation and an in-depth look at several methods for fusing satellite images.Important findings from related papers for RSIF are treated methodically, allowing readers to investigate novel facets and research avenues in this area.The fusion of remote sensing images is further categorized into pansharpening and STF methods, and some application areas and challenges of these methodologies are also addressed.
A review of the literature in this paper shows that VO-based methods that use solutions to optimization problems give perfectly accurate results because they are based on strong mathematics and have clear physical implications.The variants of VO-based methods such as P+XS, SIRF, ADMM, FUSE, and R-FUSE focused on preserving the spectral details of the image.Pan-sharpening finds widespread use in change detection and land cover classification, making it likely beneficial for remote sensing.The fusion techniques are compared through standard metrics such as PSNR, SSIM, SAM, UIQI, RMSE, and CC.Satellite imagery with high frequency and resolution is needed to track terrain heterogeneity.However, there are various constraints in this regard, including the necessity to establish an equilibrium between the width of the scanning area and the size of individual pixels.Moreover, there are supplementary restrictions to consider, such as the presence of clouds, cloud shadows, and various meteorological conditions.The constraint is overcome through spatiotemporal data fusion, which results in an enhanced dataset characterized by increased spatial and temporal resolutions.A literature survey in STF exhibits that deep learning methods offer new ways to model complex temporal dynamic changes in imagery and capture spatial relationships in STF.It also generates excellent outcomes and can efficiently learn feature relationships in spatiotemporal data.The fusion techniques are compared through standard metrics such as UIQI, RMSE, and CC over various bands.Therefore, many real-time remote sensing imaging systems rely heavily on the advancement of an effective fusion approach.This in-depth study includes substantial insights concerning numerous challenges linked to fusion techniques.Future studies in this area should concentrate on developing more hybrid methods that produce better fusion results with higher accuracy.

Figure 2 .
Figure 2. A flowchart depicting the process of image fusion at different levels.(a) pixel-level image fusion; (b) feature-level image fusion; and (c) decision-level image fusion.

Figure 3 .
Figure 3. Schematic illustrations of two image fusion techniques: (a) Image Pansharpening and (b) STF.

Figure 6 .
Figure 6.Dataset for 30 m Landsat 8 (a) on July 2, 2013, and (b) on August 17, 2013.MODIS images for the dataset at 500 m (c) on July 2, 2013, and (d) on August 17, 2013.Qualitative analysis of various methods for 30 m Landsat 8 images on August 17, 2013 is shown in (e)-(j), where (e) is the Landsat 8 image obtained with Fit-FC, (f) is the STDFA-derived Landsat 8 image, (g) is the FSDAF-derived Landsat 8 image, (h) is the 30 m Landsat 8 image obtained with STI-FM, (i) is the 30 m HCM-derived Landsat 8 image, (j) is the 30 m STARFM-derived Landsat 8 image, and (j) is the residual CNN-derived Landsat 8 images.Reproduced from [165].CC BY 4.0.
(a) and (b) shows the cloud-free pre and after landslide incident on Sentinel-2 pictures obtained on May 23, 2016, and August 6, 2019, respectively, using the band combination 8-4-3.The ResU-Net is used to generate the landslide probability map shown in figure 12 (c), where pixel values closer to 1 indicate a larger likelihood of the landslide class, with 1 and 2 denoting two expanded areas.The comparisons of landslide

Figure 11 .
Figure 11.Results of ship target detection for optical and SAR images.(a)Yokosuka Base optical image detection results, (b) partially magnified drawing of (a), (c) Fusion method detection results (displayed on optical image), (d) partially magnified drawing of (c), (e) Fusion method detection results (displayed on SAR image), (f) partially magnified drawing of (e).Reproduced from [175].CC BY 4.0.

Figure 12 .
Figure 12.Cloud-free acquired (a) pre (23.05.2016),(b) post (06.08.2019) landslide event Sentinel-2 images, (c) landslide probability map from the applied ResU-Net, and (d) comparisons of landslide inventories of the holdout testing area with the probability map of the ResU-Net model.Reproduced from [181].CC BY 4.0.

Figure 13 .
Figure 13.(a) Maps depicting the crop rotation in Nao-Me-Toque and Cachoeira do Sul, located in Rio Grande do Sul, during a span of four years (2017-2018, 2018-2019, 2019-2020, 2020-2021).(a-d) depict the crop rotation of soybean, corn, and rice over a span of four growing seasons.(e) displays maps illustrating the crops grown throughout one, two, three, and four years.Reprinted from [186], Copyright (2023), with permission from Elsevier.

Figure 14 .
Figure 14.Spatial distribution for the start of the season (SOS) in 2017 (a) SOS obtained from MCD12Q2; (b) the SOS derived from the MOD09Q1 EVI2 time series by the amplitude threshold (AT); (c) the SOS derived from the Landsat-MOD09Q1 EVI2 time series by the amplitude threshold (AT); (d) the SOS derived from the Landsat-MOD09Q1 EVI2 time series by the first-order derivative (FOD) predicted by the ESTARFM and the EVI2 calculated by the simulated red bands and NIR bands; (e) Performance of ESTARFMpredicted red and NIR bands and EVI2 computed from simulated bands.Reproduced from [189].CC BY 4.0.

Table 2 .
Quality Metrics for performance comparison evaluation.

Table 3 .
A brief overview of prominent spatiotemporal image fusion techniques.