Few-Shot Scene Classification with attention mechanism in Remote Sensing

Remote sensing scene classification is a hot research topic in computer vision and it is of great significance to the semantic understanding of remote sensing images. At present, remote sensing scene classification methods based on deep learning occupy a dominant position in this field. However, it suffers from the lack of samples and poor model generalization ability in actual application scenarios. Therefore, this paper proposes a few-shot remote scene classification method based on attention mechanism, and designs a structure of dual-branches similarity measurement. This method is based on the meta-learning training strategy to divide the dataset into tasks. At the meantime, the input images are divided into blocks to preserve the feature distribution in the remote sensing image. Then the lightweight attention module is introduced into the feature extraction network to reduce the risk of overfitting and ensure the acquisition of discriminative features. Finally we add a dual-branches similarity measurement module on the basis of Earth Mover’s Distance to improve the discriminative ability of the classifier. The results show that compared with the classic small-sample learning method, the few-shot remote scene classification method proposed in this paper can significantly improve the classification performance.


Introduction
In recent years, aerial remote sensing and UAV technology has developed rapidly. As an effective information carrier with rich shape, texture and scene semantic information, aerial remote sensing images are widely used in intelligence reconnaissance, environmental monitoring, natural disaster prevention, and water resource protection applications. However, the massive amount of high-definition remote sensing image data generated by aerial photography has exceeded the ability of manual real-time interpretation and understanding, making it difficult to obtain relevant information in a timely and accurate manner. As one of the research hotspots in the field of computer vision, remote sensing image scene classification [1]- [2] is of great significance to the understanding of the scene and semantic information contained in the data, and is an important basis for subsequent projects and military operations.
Compared with the traditional image classification methods, the remote sensing image scene classification task shows the characteristics of complex scene information and difficult data annotation. At the same time, due to the characteristics of unmanned aerial vehicles and the variability of their mission scenes, some application scenes do not have the objective conditions for long-term collection of remote sensing image data, which leads to the problems of less sample data and poor generalization ability of models in the practical application of remote sensing image scene classification. The implementation of remote sensing image scene classification methods based on deep learning is more difficult and the performance is significantly reduced. Therefore, it is required to construct a remote sensing image scene classification models with strong generalization ability based on a small number of data samples to solve these practical problems. At the same time, few-shot scene classification in remote sensing faces more challenges than that in natural scenes. The special characteristics of remote sensing images are mainly manifested in：  Interference from irrelevant objects: remote sensing images are top-down views and inevitably contain numerous objects that are irrelevant to the semantic class of the scene.  Large intra-class variation: remotely sensed features are complex and diverse, and remotely sensed images can be affected by a variety of imaging conditions [11].  Inter-class similarity: different scene images may contain similar feature types. In recent years, there has been a proliferation of few-shot learning methods. Currently, combining meta-learning [3]- [4] and metric learning [5]- [6] is the mainstream development direction. Among them, this kind of method firstly constructs tasks of small samples based on meta-learning, which enhances the generalization performance of the model for new tasks. Secondly, it makes predictions by measuring the similarity between the features of images. Related algorithms include the classical Matching network [7], Relation network [8], Prototypical network [9], etc. In 2020, Chi Zhang [10] et al. proposed a method based on image region block matching, using EMD (Earth Mover's Distance) distance to calculate the correlation between image representations.
Based on the EMD matching block, this paper proposes an algorithm of few-shot scene classification in remote sensing based on attention mechanism, and designs the dual-branches structure to measure the similarity of images. (1) The method is based on a meta-learning training strategy for task-based partitioning of the dataset; (2) to preserve the distribution of remote sensing image features, the input images are overlapped into blocks; (3) a lightweight attention module is introduced into the feature extraction network to reduce the risk of overfitting while ensuring the acquisition of discriminative features; (4) the dual-branches similarity measurement module is added to the EMD distance matching block to improve the discriminative ability of the classifier. Experiments on two popular remote sensing scene datasets, UCMerced_LandUse [12] and NWPU-RESISC45 [13], show that the method in this paper achieves significant improvement in the accuracy of remote sensing scene classification compared with the classical few-shot image classification and deepEMD [10] methods.

Methodology
This paper proposes a few-shot scene classification algorithm based on attention mechanism in remote sensing whose network model structure is shown in Fig. 1. Firstly, the algorithm divides the data set into tasks based on meta-learning strategy, and feeds the overlapped and blocked support set and query set images into a feature extraction network with shared weights.  Fig.1 Framework of proposed algorithm The network is embedded with a lightweight channel-space grouping attention module with good discriminative feature extraction capability. Finally, the similarity of the images in the meta-task is measured by a dual-branches discriminant based on the EMD distance, which leads to the predictions of the query set images.

Meta-learning task division
Deep neural networks are usually trained iteratively based on a specific large dataset to obtain the model parameters appropriate for the task. However, there are differences between this learning strategy and the human model of acquiring new knowledge. Humans have the ability to learn by example and use their previous learning experience to complete new tasks quickly.
Therefore, this paper adopts the meta-learning training strategy to divide the data set into different image categories in the train set and the test set for meeting the requirements of the model for the generalization ability of the new task. For example, in the N-way M-shot, N categories are randomly selected in the training set, and M samples are randomly selected from each of these N categories, making a total of N*M samples to form the support set in this meta-task, assuming that the labels of the support set are known. The extraction rules for the test task are the same as above. Meta-learning training based on the divided tasks allows the model to acquire prior knowledge that enables it to generalise better in tasks containing new categories.

Image overlap chunking
Remote sensing images have a unique perspective, wide field of view, large image size and rich and diverse features. In contrast, most remote sensing images do not contain iconic features or iconic features are not located in the centre of that image. Therefore, if the whole image is fed into the feature extraction network and pooled through several layers of downsampling, it will result in a large number of detailed features in the original image being lost.
To address the above issues, the image is overlapped and partitioned before feature extraction. In this paper, the original image is cut according to the positions of top left, bottom left, top right, bottom right and centre. The discriminative features in the remote sensing image are retained to a great extent. It also serves as a data augmentation to reduce the risk of overfitting.

Lightweight spatial-channel attention-based feature extraction
To obtain the high-dimensional abstract features of an image, a feature extraction model needs to be constructed to learn a practical feature embedding space in which to measure the degree of similarity between images before performing the similarity metric. In this paper, ResNet-12 [14] is used as the backbone network for feature extraction. Each residual block establishes a cross-layer connection structure between its input and output which allows a better transfer of the reverse gradient to the shallow network andovercomes the problem of gradient disappearance in deep neural networks. The mathematical representation of the residual structure is shown below: , (1) where denotes the direct mapping of the residual block input and , consists of a number of convolutional layers.
Remote sensing images are rich in spatial information. At the same time, the convolution operation maps the features to a high-dimensional space, and the data channels are rich in detailed features. To overcome the "inter-class similarity" and "intra-class variability" of remote sensing images, the feature extraction network needs to be able to acquire discriminative features.
Attention mechanisms enable neural networks to focus accurately on effective elements and have become an important component in improving the performance of deep neural networks [16] [17]. Attention mechanisms are mainly divided into spatial and channel attention, capable of capturing pixellevel and data channel dependencies, respectively. In computer vision research, channel attention and spatial attention in combination often result in performance gains. At the same time, feature extraction networks in few-shot learning have higher restrictions on model complexity, with higher model complexity exposing higher risk of overfitting. In order to balance the needs of network feature extraction capability and model complexity, this paper introduces a lightweight channel-space attention module based on Shuffle units in the residual structure, as shown in Fig 2(b).
The process of implementing lightweight channel-space attention based on Shuffle units [18] consists of the following steps: feature grouping, channel attention, space attention, and sub-feature aggregation. , … , , ∈ / . By dividing into two groups and again according to channels, the number of channels in each group becomes /2 , and the two branches are processed for channel and spatial attention respectively.
2) Channel attention Integrating global information about the channel based on global average pooling generates a channel statistics vector with a scale of C/2G x 1 x 1. The global average pooling is calculated as follows: The output feature ′ of the channel attention module is obtained by weighting the feature map based on the global statistics of the channel.
• (3) 3) Spatial attention Image spatial statistics are obtained based on group normalization. The final spatial attention output ′ is obtained by augmenting the feature representation with a fully connected layer.
• (4) 4) Sub-feature aggregation The feature vectors obtained from the two branches are spliced and the number of channels becomes / . The cross-group feature exchange is achieved along the channel dimension based on channel shuffle [15].
The SA-module has a number of channels per branch of /2 and a total number of parameters of 3 / , with G taken as 8, 16, 32, etc. The SA-module is therefore a lightweight network structure, which greatly reduces the complexity of the model with the addition of an attention mechanism. Because there are no instance-level targets in remote sensing images in most cases, distribution information is more important than single category representation. Therefore, the remote sensing images should be overlapped and segmented before being input into the network, , and after feature extraction a set of local feature vectors of the input image is obtained, which then forms the distribution representation of the image. Based on this, the optimal matching cost between the distributions is calculated. The smaller the matching cost is, the more similar the two images are.

Structure of the dual-branches similarity measurement.
Feature extraction yields a local feature map for each segmented image block as ℎ i 1, … ,5 . .It is necessary to globally pool feature maps to obtain local feature vectors, in which global pooling is often characterized by global maximum pooling and global average pooling. Maximum pooling focuses on the largest and fastest responsive part of the feature, while average pooling focuses on the overall feature. This paper uses global maximum pooling and global average pooling in parallel to obtain the local feature vector, and designs two branches to obtain the significant local feature vector and the comprehensive local feature vector of the image respectivelyto consider both maximum response features and global features.
In order to improve the learnability and feature characterization ability of the two branches, and to enhance the discriminatory ability of the subsequent classifier for "intra-class variability" and "interclass similarity" between distributed features, we consider adding a channel attention-based SE module [20] to each of the two branches. The principle of the SE module is the same as that of the channel attention section in the previous subsection, and the structure of the two-branch discriminative network is shown in

Metric
The metric used to evaluate the performance of the algorithm is the average classification accuracy. The average classification accuracy is calculated as shown below: where M denotes the number of small sample classification tasks, denotes the number of correctly classified samples in each task, and denotes the total number of samples to be classified in the task.

Training Details
The algorithmic model implementation in this paper is based on the pytorch deep learning framework with an Nvidia Titan X (Pascal) graphics card with 12GB of video memory.
During training and testing, the meta-tasks were generated by N-way M-shot division rules, in which each image classification is unique. Considering the limitation of computational resources, this paper sets the tasks as5-way 1-shot, i.e. each task contains a total of 5 samples from the support set and 75 samples from the query set. The specific training parameters are set as follows: Using SGD as the neural network optimizer, the initial learning rate was 0.0005, and the learning rate decayed to 0.5 times of the previous stage every 10 epochs, with a total of 50 epochs trained. the input data dimension of the network was 84*84*3, and the batch_size was refers to the number of support set images in each category, represents the number of query set images in each category, and refers to the number of overlapping chunks done on the images before input.

Evaluation of Model performance
The grouping number G shall be an indispensable hyperparameter setting given that SA-module requires grouping of feature maps. In order to obtain the optimal grouping parameter G, the grouping number G is set to 8, 16 and 32 under the two-branch discriminant structure with UCMerced_LandUse as the dataset to test the influence of different grouping numbers on the performance of the algorithm. Fig.7 The influence of G As shown in Fig. 7, adding SA-module to the feature extraction network structure can significantly improve the model performance. With the increase of G value and the number of groupings, the size of SA-module becomes smaller. It is noticed that the number of parameters brought by this module to the feature extraction network decreases, which reduces the complexity of the model and minimizes the overfitting problem of few-shot learning. Therefore, the performance of the model is improved when the groupings G increases. In the following experiments the number of groupings G for the fixed SAmodule is 32.
To verify the validity and reasonableness of the algorithmic model in this paper, the ablation experiments of several network structures are shown below: 1) Structure without SA-module and dual-branches similarity measurement (origin) 2) Structure with dual-branches similarity measurement which has no SE-module (Bi_withoutse) 3) Structure with dual-branches similarity measurement which has SE-module (Bi_withse) 4) SA-module only, without dual-branches similarity measurement (SA-module) 5) SA-module and dual-branches similarity measurement without SE-module (SA-module+Bi_ withoutse) 6) SA-module and dual-branches similarity measurement with SE-module (SA-module+Bi_withse) Table 1 lists the performance comparisons for each of the above experimental setups. As can be seen from Table 1, the proposed method generally improves the model performance further on the UCMerced_LandUse and NWPU-RESISC45 datasets, which shows the effectiveness and reasonableness of the network structure combining the SA-module and the two-branch discriminant module in the few-shot remote sensing image classification problem. On the UCMerced_LandUse dataset, the optimal performance was obtained for the SA-module and the dual-branches similarity measurement without the SE-module; however, on the NWPU-RESISC45 dataset, the optimal performance was obtained for the SA-module and the dual-branches similarity measurement with the SE-module.
By comparing the training and validation accuracy of the two datasets in Fig. 8, it can be found that the validation accuracy of UCMerced_LandUse is better than NWPU-RESISC45. But there is a big difference in the training and validation accuracy of the UCMerced_LandUse, showing a more obvious over-fitting tendency. On the contrary, there is little difference between the training accuracy and validation accuracy of the training process on NWPU-RESISC45. Based on this inference, the algorithm has a strong generalization ability for NWPU-RESISC45. The research results are consistent with the trend of network complexity shown in Table 1. Based on the feature extraction backbone network with SA module, compared with the dual-branches similarity measurement without SE module, the performance of the dual-branches similarity measurement with SE module on the UCMerced_LandUse is slightly reduced, while the performance on the NWPU-RESISC45 is slightly improved. The difference between the two datasets is that NWPU-RESISC45 contains more scene categories, so it can be inferred that the more categories included in the training process, the stronger the model's generalization ability to new categories.  In order to verify the effectiveness of this algorithm, the performance of this algorithm model was compared with that of classical few-shot learning algorithms. The classical few-shot learning algorithms are as follows. 1) MAML The aim of MAML [21] is to obtain a good set of initialization parameters through training. The requirement to quickly adapt to a new task using a small number of samples is achieved by performing several steps of parameter tuning based on the good initialization parameters.

Comparisons with Other few-shot learning Methods
2) Prototypical Network Prototypical Network maps the support set and query set samples into the feature space to calculate the centre of each category, and determine the category of the query set samples by comparing the Euclidean distance from the query set features to the category centre. 3

) Relation Network
The Relation Network maps the samples into the feature space and constructs a neural network to calculate the distance between the samples and analyse the degree of similarity accordingly. 4) DeepEMD DeepEMD maps the samples into the feature space through a neural network and uses the EMD distance as a measure of similarity.
As can be seen from Table 2, the algorithm proposed in this paper has greater advantages in performance than the algorithm mentioned above, which has achieved better performance on two datasets of remote sensing, thus proving the effectiveness of the algorithm in classifying remote sensing images with a small number of shots.

Conclusion
For addressing the problems of small sample data and poor generalization ability of the model in the application scenario, this paper proposes an algorithm of few-shot scene classification in remote sensing based on attention mechanism and the meta-learning framework, then designs a dual-branches structure for similarity measurement.
Aiming at the problems that the distribution characteristics of remote sensing images are not obvious and the key role is prominent, the input images are overlapped and partitioned. In order to solve the difficulties of "inter-class similarity" and "intra-class difference" in remote sensing images, a spatial and channel-based attention module is added to the feature extraction network, and a Shuffle unit is embedded in this attention module to avoid the increase of model complexity caused by attention, which leads to the increase of overfitting in few-shot learning tasks. Eventually, the dual-branches module was added to the similarity measurement block in order to enhance the characterisation of discriminative features. The effectiveness of this scheme was verified on two benchmark remote sensing image classification datasets.
However, the current mainstream work mainly focuses on improving performance. The ability to deploy hardware for few-shot scene classification in remote sensing is the key to practical application. Therefore, it is necessary to conduct in-depth research on the lightweight design of the model to achieve the above results in the future.