GA-based weighted ensemble learning for multi-label aerial image classification using convolutional neural networks and vision transformers

Multi-label classification (MLC) of aerial images is a crucial task in remote sensing image analysis. Traditional image classification methods have limitations in image feature extraction, leading to an increasing use of deep learning models, such as convolutional neural networks (CNN) and vision transformers (ViT). However, the standalone use of these models may have limitations when dealing with MLC. To enhance the generalization performance of MLC of aerial images, this paper combines two CNN and two ViT models, comparing four single deep learning models, a manually weighted ensemble learning method, and a GA-based weighted ensemble method. The experimental results using two public multi-label aerial image datasets show that the classification performance of ViT models is better than CNN models, the traditional weighted ensemble learning model performs better than a single deep learning model, and the GA-based weighted ensemble method performs better than the manually weighted ensemble learning method. The GA-based weighted ensemble method proposed in this study can achieve better MLC performance of aerial images than previous results.


Introduction
In recent years, the rise of remote sensing technology has provided a wealth of geographic images for almost every corner of the Earth's surface.In fact, remote sensing technology has not only opened a door to help people understand the Earth, but governments around the world have also applied remote sensing to various public services, from weather reporting, urban planning, disaster prevention, and traffic monitoring.Today, people cannot imagine life without remote sensing, making remote sensing technology increasingly important in global data collection tasks.
Machine learning models have been widely applied in various research fields.Due to the theory of deep learning [1], which provides a new and effective method to automatically extract deep features directly from remote sensing images, unsupervised feature learning has become possible from large amounts of raw image data [2].Deep learning (DL), as an emerging algorithm for machine learning, has seen various deep learning architectures flourish in recent years [3].Among them, deep learning-driven convolutional neural network (CNN) technology, including image classification, object localization, object detection and image segmentation, has made considerable progress and has been successfully applied to remote sensing image analysis.Deep learning has been proven to be a promising technology for remote sensing image analysis and is the current main trend in remote sensing image processing [4,5].
Although CNNs have been successfully applied to computer vision tasks and have become the dominant deep learning model, in recent years, inspired by the outstanding performance of transformer architectures in natural language tasks, some models based on CNNs have attempted to capture long-range dependencies or channel-level dependencies by adding self-attention mechanism layers in any space or have tried to completely replace traditional convolution layers with global or local self-attention blocks.Unlike CNN models, the design of vision transformers (ViT) models [6] requires minimal inductive bias and is naturally suitable as a set function.In addition, the design of Transformers models allows similar processing blocks to handle multiple modalities (such as images, videos, text and speech), and demonstrates excellent scalability for large capacity networks and large datasets.These advantages have led to exciting progress in many visual tasks using ViT models [7,8].
With the development of image-related remote sensing technologies, the spatial resolution of remote sensing images continues to improve.In medium-low-resolution remote sensing images, different objects may share the same spectral response curve, or the same object may have different spectral response curves; while in high-resolution remote sensing images with a resolution of 1.5 m to 4 µm, they usually do not have high spectral resolution.These problems make pixel-level or object-level classification methods have many limitations.The scene was originally a combination of multiple objects, environments, and semantics in the image.With technological development, many investigators have used scenes as basic analysis units in recent years, and supervised deep learning models have made considerable progress in remote sensing image scene classification tasks [4,5,[9][10][11].Xu et al [12] developed a multi-embedding contrastive learning framework for remote sensing image classification.This framework encourages the network to acquire image representations by comparing image embeddings extracted from multiple encoders and predictors.However, all of these models are based on single-label remote sensing image scene classification methods and are usually insufficient to fully describe the content of real-world images.
In the past, remote sensing image scene classification techniques used mainly the most prominent terrain features in the image as the basic interpretation unit to perform single-label scene classification tasks, aiming to assign a semantic category to the remote sensing scene based on its visual and image content.However, a modern high-resolution remote sensing image usually contains rich information about land objects, and there are many differences between the remote sensing images of the same scene, which may oversimplify the complexity of high-resolution remote sensing image scenes using single-label remote sensing image scene classification.This brings many challenges to the task of understanding remote sensing image scenes.Therefore, the task of multi-label image classification (MLC) is attracting more and more attention in the field of remote sensing because it is not expensive, but it has great research potential, making the development of MLC technology a popular research direction in recent years [13][14][15][16][17][18][19].
Compared to single-label remote sensing image scene classification, multi-label remote sensing image classification is a more factual task.MLC mainly predicts multiple semantic labels to describe the scene content of remote sensing images.Due to its stronger descriptive ability, MLC can be applied to many fields, such as land cover classification [14], image retrieval [20], image segmentation [21], or image object detection [22].Multi-label remote sensing image classification is a more difficult task because there are usually complex interactions between multiple categories.How to effectively extract distinctive semantic representation features and then effectively distinguish multiple categories is a research topic worth exploring more in the field of remote sensing [16,18].
The literature on using deep learning models for multi-label remote sensing image classification is briefly described as follows.Stivaktakis et al [14] proposed a CNN architecture composed of 3 convolutional + ReLU + max pooling layers, 1 Dense-ReLU layer, and 1 Dense-Sigmoid layer for UCM multi-label data set, using data augmentation techniques to start model training from scratch.Hua et al [23] used the VGG-16 pre-trained model as the backbone and proposed a relationship network composed of a label feature extraction module, an attention area extraction module, and a label relationship inference module for MLC tasks of UCM and AID multi-label data sets.Sumbul and Dem İr [24] proposed a K-Branch CNN architecture for multi-label classification (MLC) tasks of the BigEarthNet dataset using a bidirectional long-and short-term memory network to implement multiple attention strategies.Qi et al [15] proposed a new set of multi-label remote sensing images named MLRSNet.The MLRSNet multi-label dataset contains a total of 109 161 samples in 46 scene categories, each image has at least one of 60 predefined labels.Using eight fine-tuned CNN models for performance evaluation, the results show that DenseNet201 has significantly better performance.Wang et al [25] combined local attention [24] and global attention pool to capture potential relationships between multiple labels for GC-MLFNet modeling, and evaluated performance on UCM and AID multi-label datasets.Stoimchev et al [18] compared two learning strategies, end-to-end learning and feature extractor plus tree ensembles, to classify seven remote sensing multi-label image sets.They reported that the performance of the fine-tuned model is significantly better than the ImageNet pre-trained model, and EfficientNetB2 is the best model for end-to-end learning.The efficientNetB2 plus the random forest method can achieve better performance than the EfficientNetB2 model.Dimitrovski et al [19] investigated the performance evaluation of ten advanced deep learning models on seven multi-label remote sensing image sets.When evaluating the model performance using the average accuracy metric, they observed that the Swin Transformer outperformed other models in six out of the total MLC tasks.Möllenbrok et al [26] introduced a deep active learning approach for the MLC of remote sensing M-H Tseng images.Their investigation focused on assessing the effectiveness of various active learning query functions for the MLC task in RS.
Based on the above literature review, this study aims to address three primary research questions, which are outlined below.
• Which training strategy is more efficient for MLC: How is the dataset divided?Are the parameters of the backbone network frozen?What is the monitoring metric for early stopping?• Which type of model yields better results for MLC: CNN or ViT? • Which approach works better for MLC: utilizing a single model, implementing a traditional ensemble approach, or applying a GA-based weighted ensemble technique?
The research questions mentioned above will be subjected to rigorous examination and analysis in this study.
The research questions mentioned above will be subjected to rigorous examination and analysis in this study.
The rest of this study is divided into four parts.First, we will introduce the materials and methods in section 2. Next, section 3 presents the experimental results of this study.We then describe the discussion in section 4. Finally, we will conclude the main findings of this paper and summarize the limitations of this study and the potential directions for future research.

Materials and methods
This section is divided into three subsections to explain the datasets, methods, and evaluation metrics, as follows:

Datasets
This study uses two annotated aerial image multi-label public datasets for subsequent experiments, as explained below.
The UCM multi-label dataset [20] contains 2100 remote sensing images with a spatial resolution of 0.3 m/pixel and an image size of 256 × 256 pixels.It is divided into the following 17 categories: airplanes, bare soil, buildings, cars, forests, courts, docks, fields, grasslands, mobile homes, sidewalks, sand, sea, ships, tanks, trees, and water.Some example images and their labels are shown in figure 1.
The MLRSNet multi-label dataset [15] contains a total of 109 161 samples in 46 scene categories.Each image has at least one to thirteen of the 60 predefined labels.The number of sample images in a scene category ranges from 1500 to 3000.The spatial resolution of the images ranges from 0.1 m/pixel to 10 m/pixel and the image size is 256 × 256 pixels.Some examples of images and their labels are shown in figure 2.

Methods
The framework of the proposed method is shown in figure 3. The following subsections explain the deep neural network models, genetic algorithms (GA), and ensemble learning methods used in this study.
To improve the MLC task of aerial images, the overall framework proposed in this study is shown in figure 3.In terms of data splitting, this study uses 5-time experiments to randomly split the image data set into a training data set (80%) and a test data set (20%).In terms of the backbone network of the model, this study uses two newer CNN architectures and ViT architectures, namely DenseNet201 [27], EfficientNetV2B2 [28], Swin Transformer [27] and visual attention network (VAN) [29].The backbone network architecture does not include fully connected layers on top of the network, and uses ImageNet's pre-trained weights as the initial weights.It should be noted that the Swin Transformer backbone network adds a fully connected layer with a final layer of 17 (the total number of multi-labels in UCM) or 60 (the total number of multi-labels in MLRSNet) nodes, while the DenseNet201, EfficientNetV2B2, and VAN backbone networks have first connected the global average pooling layer, and then the fully connected layer.The activation function of each model in the fully connected layer is set to the sigmoid function.In general, the sigmoid function models a single class and is suitable for MLC problems.For example, in an aerial image, various categories such as houses, cars, and roads may coexist.On the other hand, the softmax function is applicable to single-label multi-class classification problems.For example, in handwritten digit recognition, each image can only belong to one specific category.The sigmoid function can map the value of the network output vector to the (0, 1) interval, representing the prediction probability of each category.Then this study sets a threshold of 0.5 to convert the output of the network into a binary vector to generate multilabel predictions, such as [1, 0, 1, …, 0], where 1 indicates that there is a corresponding label in the image; otherwise it is 0. In terms of ensemble learning, this study integrates the prediction probabilities of four deep learning models using a soft voting mechanism.In addition, to improve the generalizability of the training model, this study embeds the GA algorithm to optimize the weight values w i of the four DL models.

Models
The specifications of the pre-trained models used in all experiments in this paper are shown in table 1, which includes two CNN such as DenseNet201 [27] and EfficientNetV2B2 [28], and two visual transformer neural networks such as Swin Transformer [30] and VAN [29].The four DL models are briefly described as follows: DenseNet201 is a deep CNN model based on dense connections, proposed by Huang et al [27] [29] in 2022.Its main feature is to use the spatial channel attention of the large kernel attention module to capture the dependencies on space and channels, achieving adaptability and long-term relevance in self-attention.VAN has a simple hierarchical structure, that is, it has four-stage sequences that continuously reduce output spatial resolution.After downsampling, the output size of all other layers in the stage remains unchanged, that is, spatial resolution and channel number.It achieved a top-1 accuracy of 82.8% in the ImageNet dataset.

GAs
This study believes that if appropriate weight values can be assigned to different deep learning models, two CNNs and two ViTs can be integrated in a simple way to construct a better multi-label classifier for aerial images.To achieve this goal, this section introduces GAs to optimize the weighted ensemble values of different models.GA are the most well-known and widely applied evolutionary algorithms among all evolutionary computation methods [31].GAs were jointly researched by John Holland and his students at the University of Michigan around the 1970s.Holland originally studied cellular automata and laid the prototype for the development of GAs when exploring the problem of dynamic adaptation adjustment between natural and artificial systems [32].GAs are based on the theory of evolution, simulating the survival evolution law of 'survival of the fittest' in the biological world.Every species competes with each other in the living environment and only species with strong adaptability can survive and reproduce.This natural selection mechanism will gradually develop the best species.Holland believes that the evolution of nature occurs in the genes of biological chromosomes.The characteristics of each species come from the gene arrangement of the previous generation of that species.Evolution refers to changes in genes between each generation.Survival of the fittest means that the gene arrangement of this generation is better than the gene arrangement of the previous generation, producing a generation that is more adaptable to environmental survival than the previous generation.Therefore, GAs emphasize changes in gene types, encode possible answers to problems in gene types, and use genetic operators to evolve to find the best solution.These genetic operators simulate the evolution process in the biological world, including reproduction/selection, crossover, and mutation.Their algorithm principle is to control the target value of solving problems with a fitness function and to search for the best solution to problems through selection, crossover, and mutation procedures.The evolutionary process of GAs begins with randomly generating a population of size N for generation 0, evaluating each individual to obtain fitness values [31].Each time, two individuals are selected as parents based on fitness values.Individuals with higher fitness values have more chances of being selected than those with lower fitness values, so individuals with higher fitness values have multiple chances of reproducing in future generations.Parents decide whether to mate to produce offspring or simply replicate themselves based on the crossover probability, and the offspring after mating decide whether to mutate based on the mutation probability, where the crossover probability is much higher than the mutation probability.Therefore, mating is the main evolutionary operator of GAs, and mutation is a secondary operator.The offspring produced after evolution or the original parents together form the population of the next generation.Figure 4 illustrates the overall process of GAs.
Because GAs search for multiple different solutions with a population composed of multiple different points, they can cover a larger search space, and the solutions obtained will be more stable and closer to optimal solutions.GAs have advantages over traditional optimization as follows: easy to use, wide applicability, multi-point search characteristics, suitable for handling complex problems, and have a higher probability of obtaining global optimal solutions.GAs have been proven to be an effective technique to solve search and optimization problems [33][34][35].In recent years, researchers have begun to apply GAs to optimize the prediction results of deep neural networks.For example, Ayan et al [36] used a GA to perform a weighted integration of three CNN for crop pest classification tasks.Feng et al [37] applied GAs to construct a CNN structure and optimize the hyperparameter settings of the model.

Ensemble learning
Ensemble Learning is a machine learning technique that combines the prediction results of multiple base models through different mechanisms to improve the accuracy and generalization ability of the model.In recent years, due to the good results achieved by deep neural networks in various artificial intelligence applications, ensemble learning has also received widespread attention in the field of deep neural networks [38].The main advantage of ensemble learning is that by combining multiple base models, it can achieve better predictive performance than any single model, making it a simple and easy-to-implement model optimization strategy.The voting mechanism of ensemble learning can be roughly divided into two types: hard voting and soft voting [39,40].Hard voting is to conduct majority voting on the category prediction values of multiple base models to obtain the final category prediction.The advantage of hard voting is that it is simple to calculate, but the disadvantage is that it is easily affected by noise.Soft voting is to weight the category prediction probabilities of multiple base models to estimate the final category prediction.The advantage of soft voting is that it can reduce the impact of noise, but the disadvantage is that it has higher computational complexity.In figure 3 and equation (1), if the weighted values w i of the four DL models are set to 1 or 0, that means using the traditional soft voting mechanism for ensemble learning.After averaging the label prediction probabilities of the two CNN models and the two ViT models, the label corresponding to the one with the highest probability is the final prediction result.
This study believes that various base models should have different degrees of importance for the final prediction result.GAs can be used to search for the importance weights of each base model, thereby improving the model generalization ability to achieve better prediction performance.Figure 3 shows the architecture diagram of the proposed GA-based soft voting mechanism for ensemble learning in this study.After weighting the label prediction probabilities of two CNN models and two ViT models as listed in equation ( 2), the label corresponding to the largest probability is the final prediction result.Geatpy is a high-performance practical Python evolutionary algorithm toolbox [41].This study uses the Geatpy package to build a GA to optimize the search for weighted values w i of four deep learning models.It should be noted that the sum of all weighted values w i in equation ( 2) is 1. (2)

Evaluation metrics
The evaluation of the performance of machine learning models is an important task, and different evaluation metrics can reflect the future predictive capacity of the model on new image samples in different aspects.For the performance evaluation of MLC tasks, this study calculates evaluation metrics such as Marco mean average precision (mAP), Macro Precision score, Macro F1 score, Micro Precision score, Micro F1 score, Samples Precision score, Samples F1 score to examine the experimental results.These evaluation metrics are calculated based on the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).The precision score is the proportion of the number of true positives to the sum of true positives and false positives, and the calculation formula is listed in equation (3).
The recall score is the proportion of the number of true positives to the sum of true positives and false negatives, and the calculation formula is listed in equation (4).
The F1 score is the weighted ratio of the product of precision and recall to the sum of precision and recall, and the calculation formula is listed in equation (5).
Then, this study uses the following three strategies to calculate these evaluation metrics in MLC tasks.'Micro-average': globally calculate indicators by calculating the total number of true positives, false negatives, and false positives.'Macro-average': calculate the indicators for each label and find their unweighted average.'Sample-average': calculate the indicators for each instance and find their average.This paper uses the python Scikit-learn package to calculate these evaluation metrics [42].

Results
This section is divided into four subsections, which describe the experimental settings and the results of three experiments: the effects of training strategies, the effects of model architecture, and the effects of ensemble learning.The details are as follows:

Experimental settings
The training equipment used in this study is based on the Window11 operating system platform.The hardware configuration uses an Intel ® Core™ i9-11900k 3.5 GHz processor, equipped with 64GB RAM memory.The NVIDIA GeForce GTX4090 24GB graphics processor is used for deep learning calculations.The software is developed using Python programming language and Anaconda 1.9.0 (Python 3.8.12).The TensorFlow 2.8 platform is used for deep learning model training, and the CUDA version is 11.3.1.
For the UCM dataset, 80% of the data is randomly selected for training, and the remaining 20% is used for testing.Due to the high number of MLRSNe images (109 161), this study randomly selects 20% of the data for training due to hardware limitations, and the remaining 80% is used for testing.To obtain a less biased evaluation result, each experiment is repeated in five random training/testing groups, and the average value and standard deviation of the five experimental results are calculated.Since the last layer of each model is a sigmoid output, representing the predicted probability value of a certain category label, this study sets a threshold of 0.5 to calculate its corresponding predicted category label.Table 2 lists the hyperparameter settings of the four deep-network models.

The effects of training strategies
This study first explores how different deep learning model training strategies affect the generalization performance of the model when the number of multi-label image sets is limited.To eliminate the randomness of the data split, this study uses the same training/validation/test set as Dimitrovski et al [19] for the subsequent analysis.To reduce training costs, this study does not use data augmentation techniques.
Based on the use of Swin Transformer for the MLC task of UCM aerial image set, and evaluated with five performance indicators (such as mAP, Macro Precision, Macro F1, Micro Precision, Micro F1), table 3 lists the generalization performance analysis results of the test set under eight different training strategy combinations.The data in the table is bolded to represent locally better results, and when the data is underlined, it represents the overall best results.From the data in the table, we can obtain the following three findings: (1) when executing the early stopping strategy, monitoring the test overall performance of the validation set accuracy (val_binary_accuracy) is better than monitoring the validation set loss value (val_loss).When the data set is only split into a training set/test set, monitoring the training set accuracy (binary_accuracy) is better than monitoring the training set loss value (loss), and the overall performance is better.(2) When the data set is divided into a training set/validation set/test set, the test performance of retraining all network weights is better than that of freezing backbone network weights.But when the data set is only split into a training set/test set, the test performance of freezing backbone network weights is better than retraining all network weights.(3) When the data set is only split into a training set/test set, using training set accuracy (binary_accuracy) to monitor early stopping criteria will produce the best test M-H Tseng  performance.This study believes that the reason may be that when the number of data sets is limited, after merging the validation set and training set into a larger training set and then using it to train deep learning models, models will be more likely to learn characteristics that are more in line with population distribution.Therefore, the generalizability of the model performs better.
Based on the experimental conditions in this study, including freezing backbone network weights and monitoring training set accuracy as early stopping criteria, figure 5 demonstrates the impact of various sigmoid thresholds on three F1 scores in the MLC task using the UCM image set.The results indicate that, regardless of whether we consider the Micro F1, Macro F1, or Samples F1, setting the threshold to 0.5 yields optimal performance.

The effect of model architecture
For the MLC task of UCM image set, table 4 compares the test performance of four transfer learning backbone networks.Among them, DenseNet201 and EfficientNetV2B2 belong to CNN, and Swin and VAN belong to ViT.The data in the table are bolded to represent the best results.Based on 5 performance metrics for evaluation, it clearly shows that the Swin model has optimal test performance, the VAN

The effect of ensemble learning
For the MLC task of UCM image sets, table 6 lists test performance obtained using manually adjusted ensemble weight values.Among them, four values in the first column represent the ensemble weight values of the DenseNet201, EfficientNetV2B2, Swin, and VAN models.For example, [1,1,0,0] represents taking the prediction probabilities of DenseNet201 and EfficientNetV2B2, two CNNs, with equal weighting; [0,0, represents taking the prediction probabilities of Swin and VAN, two ViTs, with equal weighting; [ represents taking the prediction probabilities of DenseNet201, EfficientNetV2B2, Swin and VAN models with equal weighting.Data in the table are bolded to represent the best results.The results in the table show that the ensemble results of the manual weighting of two ViTs have obtained the best performance in mAP, Macro precision, Macro F1, and Samples F1 metrics.On the other hand, taking the ensemble weighting results of four models only obtained better performance in Samples Precision score.The above results show that the ensemble of two ViTs performs best, the ensemble of four models performs second, and the ensemble weighting results of two CNNs are worse.The results of the four-model ensemble are even worse than those of just the two-ViTs ensemble.This study believes that the reason may be that the manually set weight values have not been optimized, so the next section will introduce a Real-Coded GA (RCGA) to optimize the ensemble weight values.
For the MLC task of UCM image sets, table 7 lists test results obtained using different objective functions to perform RCGA optimization of ensemble weight values.Among them, four values in the first column represent the weight values of the DenseNet201, EfficientNetV2B2, Swin, and VAN models.For example, [0,0,1,0] represents the results of the Swin model, which is also the best result when using a single model; see table 5 for details.[1,1,1,1] represents the manually set weights of the DenseNet201, EfficientNetV2B2, Swin, and VAN models, and they are weighted by equal importance.In table 7, the third to seventh rows of data represent optimization results under five maximization objective functions of mAP, Macro precision, Macro F1, Samples precision, and Samples F1 using GA-based weighted ensemble learning.Data in the table are bolded to represent the best results.The results in the table show that under different optimization objectives, the optimal ensemble weights of the four models will vary.The results also show that the most important model under the five objective functions is the Swin model.On the other hand, the results show that the importance of the DenseNet201 model is quite weak under three objective functions: mAP, Macro Precision, and Samples Precision.Under two objective functions, Macro Precision and Samples Precision, Swin and EfficientNetV2B2 are the two most important models.
In the MLC task of the MLRSNet image sets, table 8 presents the test results derived from ensemble learning.The first row illustrates the ensemble results of an equal weighting of two CNNs; the second row displays the ensemble results of an equal weighting of two ViTs; the third row depicts the ensemble results of an equal weight of four models; and the fourth row shows the performance achieved through RCGA optimization of weight values.The data highlighted in bold in the table means the best results.The ensemble results indicate that an equal weight of four models yields superior efficiency in terms of Macro precision and Samples precision scores.Meanwhile, the GA-based ensemble learning model demonstrates enhanced performance in mAP, Macro F1 and Samples F1 scores.According to the size of the weight values of  Table 9.Some examples of inferred annotations, where the labels in bold are correctly found by our methods, the labels in italics are detected by our methods but not marked as valid in the ground truth, and the labels underlined are the ground truth labels that our methods missed.Based on the above experimental results, this study compared the performance of six deep learning models on two multi-label test datasets using the Macro F1 score as the evaluation metric.The results are plotted in figure 6. Figure 6 clearly shows that the GA-based weighted ensemble method proposed in this study has the best performance, followed by the manual weighted ensemble method.The Swin model ranks third, while the VAN model and the EfficientNetV2B2 model have similar performance.The DenseNet201 model performs the worst.
The predictions of the six deep learning models for the multi-label annotations of three test examples are shown in table 9.The labels that match the ground truth are in bold, the labels that are detected by the proposed method but not in the ground truth are in italics, and the labels that are in the ground truth but not detected by the proposed method are underlined.Table 9 clearly shows the multiple labels of the three test images predicted by the trained six deep neural networks.Most of the labels can be correctly annotated by any deep learning model, but in some cases, the ability of each model to perceive the existence of certain objects is different.Overall, the GA-based weighted ensemble method proposed in this study has the best multi-label prediction performance, followed by the manual weighted ensemble method and the Swin model.Additionally, we have observed that the model may not be able to distinguish different objects that share some common features.For example, the green color of the existing tree label may make it easy for the trained model to misidentify the presence of the grass label, while objects with yellow features (such as the court, buildings, and bare-soil labels) can easily be confused and misjudged by the model.

Discussions
Which training strategy is more efficient for MLC?In this study, without the need for data augmentation, we split the dataset using a train/test approach and used the maximisation of the accuracy of the training set accuracy to monitor the early stopping criterion.The test overall performance of the trained model is superior to the results of Dimitrovski et al [19].Dimitrovski et al [19] used a 60:20:20 split for training:validation:test, random cropping and flipping for data augmentation, retraining of backbone network weights, minimum validation loss as an early stopping condition, batch size of 128, learning rate scheduler choosing between 0.01, 0.001 and 0.0001, and RAdam optimizer without weight decay.Our training strategy does not use data augmentation, freezes the backbone network weights, uses the highest training accuracy as an early stopping condition, has a batch size of 24, uses an Adam optimizer with weight decay, a learning rate of 0.0001, and a decay rate of 0.000 001.We believe that the possible reason for the superior generalization ability of the trained model, which was obtained by learning the feature distribution of a larger training dataset (1680 images) and using training accuracy for training monitoring, is compared to those using a smaller training dataset (1285 images) and a smaller validation set (395 images) with validation loss value for training monitoring.Therefore, when the amount of data collected is limited, this study recommends that when training deep neural networks, one can adopt a train/test approach to split the data set and use the maximization of the accuracy of the training set accuracy to monitor the early stopping criterion.
Which type of model can yield better results for MLC?Is it CNN or ViT? CNN is a classic architecture based on convolution and pool layers, which captures local features in the image by convolution operations and then extracts high-level features in the image by combining layers of multiple convolution and pool layers.ViT is a new architecture that uses multi-head self-attention mechanisms to learn the relationship between different regions of the image and capture global information from the image.CNN and ViT each have their own advantages, and the architecture to be used depends on the specific application scenario and the computational resources.The experimental results of this study confirm that ViT are more suitable for MLC tasks of aerial image sets than CNN.These results are consistent with the research findings of Dimitrovski et al [19] Which approach is more suitable for MLC? Utilizing a single model, implementing a traditional ensemble approach or applying a GA-based weighted ensemble technique?According to the experimental results of this study, the model generalization of the GA-based weighted ensemble method is the best, followed by the traditional ensemble method, and the single deep learning model is the worst.The above findings are consistent with the research results of Ayan et al [36].It is once again confirmed that by combining multiple deep learning models and setting appropriate weighting values, better prediction performance can be obtained than any single deep learning model or traditional ensemble method.It is a simple and easy-to-implement deep learning model optimization strategy.Additionally, based on the size of the weight value optimized by GA-based soft voting, the experimental results of this study also show that the most important model is the Swin model, followed by the Van model, the EfficientNetV2B2 model, and the DenseNet201 model ranked last.This reaffirms the superiority of ViT over CNNs for performing MLC tasks on aerial images.
To demonstrate the superiority of the GA-based weighted ensemble method proposed in this study, table 10 compares the test performance of the recent four literature (Stivaktakis et al [14], Qi et al [15], Dimitrovski et al [19], Hua et al [23]) on the MLC task of aerial images.To obtain a less biased evaluation, this study uses the average performance of 5 experimental results (Stivaktakis et al [14], Qi et al [15]) instead of the single performance of the holdout method (Dimitrovski et al [19], Hua et al [23]).For the MLC task of the UCM aerial image set, the experimental results show that the proposed GA-based weighted ensemble method has the best performance in mAP, Marco precision, Marco F-score, Samples precision, and Samples F-score indicators.For the MLC task of the aerial image set MLRSNet, in the case of training: testing = 2:8, the GA-based weighted ensemble method proposed in this study has better performance than the Qi et al [15] method.It should be noted that, compared to the Dimitrovski et al [19] method, the proposed GA-based weighted ensemble method is not inferior, because Dimitrovski et al [19] used up to 6/10 of the data for training, while this study only used 2/10 of the data for training.

Conclusions
This study explores the efficiency of different training strategies, types of model, and ensemble approaches for MLC in remote sensing.In terms of training strategy, the study finds that a train/test approach, with the maximization of training set accuracy as an early stopping criterion, yields superior results compared to a train/validation/test approach.The study suggests that when data is limited, it is beneficial to use a larger training dataset and monitor training accuracy for early stopping.Regarding the type of model, the study confirms that ViT are more suitable for MLC tasks of aerial image sets than CNN.However, the study also notes that ViT models require more training costs than CNN models.Regarding the ensemble approach, the study finds that a GA-based weighted ensemble method offers the best model generalization, followed by a traditional ensemble method, with a single deep learning model ranking last.The study reaffirms that combining multiple deep learning models and setting appropriate weighting values can yield better prediction performance than any single model or traditional ensemble method.In summary, this study highlights the superiority of a GA-based weighted ensemble method for MLC tasks in aerial images.This proposed method leverages an optimized weighted combination of two CNN and two ViT models, showcasing both promising prospects and practical applicability in remote sensing image analysis.
Based on the results of this study, future work can be expanded in the following areas.Firstly, we will incorporate more new deep learning models and multi-label aerial image sets for evaluation.Next, we will consider applying the GA-based weighted ensemble method proposed in this study to different tasks, such as object detection, image segmentation, image captioning, and image retrieval.Lastly, given the abundance of unlabeled data in the field of remote sensing, we will investigate how to integrate semi-supervised learning techniques for relevant applications in remote sensing.These three anticipated challenges in future research directions may all require greater computational power from hardware resources.The expected results of the first two tasks are highly feasible, whereas the third task represents a more challenging issue.However, it has broader practical applicability in the field of remote sensing.
in 2017.DenseNet can improve the efficiency of feature mapping use through Dense Block, reduce parameters, and reduce the probability of disappearance of gradients.Unlike ResNet, DenseNet inputs its features into the next layer in a concatenate manner, rather than using ResNet's feature summation.On the ImageNet dataset, the performance of the DenseNet model even exceeds that of VGG Net and ResNet.It achieved a top-1 accuracy of 77.3% in the ImageNet dataset.EfficientNetV2B2 is also a deep CNN model built based on EfficientNetV1, proposed by Tan and Le [28] in 2021.Its main feature is to use a combination of search and scaling of the search and scaling of the search and scaling of the training perception neural architecture to jointly optimize the model size and the training speed and to expand at faster training and inference speeds.The main advantage of the EfficientNetV2B2 model is that it can reduce the number of model parameters and training time while maintaining high accuracy.This makes EfficientNetV2B2 a very efficient deep learning model.It achieved a top-1 accuracy of 80.5% on the ImageNet dataset.The Swin Transformer is a deep neural network model based on visual transformer architecture that can serve as a general backbone network for computer vision, proposed by Liu et al[30] in 2021.Its main two features are hierarchical structure, and using moving windows to calculate self-attention to complete local and global attention, these two designs make the Swin Transformer more efficient.Swin Transformer uses a hierarchical construction method like CNN (Hierarchical feature maps), such as multi-scale feature extraction with image downsampling 4 times, 8 times, and 16 times in feature map size.Swin Transformer also uses the concept of windows multi-head self-attention (W-MSA), dividing the feature map into multiple

Figure 3 .
Figure 3.The framework of the proposed method.

Figure 4 .
Figure 4.The flowchart of genetic algorithms.

Figure 6 .
Figure 6.Performance comparison of six models using the Marco F1 score.
. On the other hand, this study compares the training costs of four deep learning models based on computer calculation time.Regardless of whether it is the UCM dataset or the MLRSNet dataset, the results show that the EfficientNetV2B2 model has the shortest training time, while the Swin model has the longest training time.The training cost ratio of the Swin model to the EfficientNetV2B2 model is about 4.4 times on the UCM dataset and about 7.3 times on the MLRSNet dataset, indicating that the ViT architecture requires more model training costs than the CNN architecture.

Table 1 .
Pretrained models used in this study.
disjoint regions (Window), and Multi-Head Self-Attention only operates within each window.It achieved a top-1 accuracy of 85.2% in the ImageNet dataset.VAN is a new visual transformer model proposed by Guo et al

Table 2 .
Hyperparameter setting for deep network models.

Table 3 .
Comparison of the results of different training strategies.

Table 4 .
Comparison of the MLC results of different models in UCM set.

Table 5 .
Comparison of the MLC results of different models in MLRSNet set.
model performs second best, the EfficientNetV2B2 model ranks third, and the DenseNet201 model ranks last.The final column of table 4 denotes the computational time spent during the model training phase.The results show that the EfficientNetV2B2 model has the shortest training time, the computer time required to train the Swin model is the longest, and the VAN model has the second longest training time.In the MLC task of the MLRSNet image set, table 5 presents a comparative analysis of the test performance of four transfer learning backbone networks.The data highlighted in bold in the table indicates the best results.The evaluation was carried out based on five performance indicators and the results reaffirm that the Swin model exhibits superior test performance, followed by the VAN model.The EfficientNetV2B2 model ranks third, while the DenseNet201 model is last.The final column of table 5 shows the computational time expended during the model training stage.It is evident once again that the Swin model

Table 6 .
Comparison of the MLC results of traditional ensemble learning in UCM set.

Table 7 .
Comparison of the MLC results of different ensemble learning weights in UCM set.

Table 8 .
Comparison of the MLC results of different ensemble learning weights in MLRSNet set.

Table 10 .
A comparative summary of the SOTA approaches for MLC.