Graph sampling based deep metric learning for cross-view geo-localization

Cross-view geo-localization has emerged as a novel computer vision task that has garnered increasing attention. This is primarily attributed to its practical significance in the domains of drone navigation and drone-view localization. Moreover, the work is particularly demanding due to its inherent requirement for cross-domain matching. There are generally two ways to train a neural network to match similar satellite and drone-view images: presentation learning with classifiers and identity loss, and metric learning with pairwise matching within mini-batches. The first takes extra computing and memory costs in large-scale learning, so this paper follows a person-reidentification method called QAConv-GS, and implements a graph sampler to mine the hardest data to form mini-batches, and a QAConv module with extra attention layers appended to compute similarity between image pairs. Batch-wise OHEM triplet loss is then used for model training. With these implementations and adaptions combined, this paper significantly improves the state of the art on the challenging University-1652 dataset.


Introduction
Cross-view Geo-localization is a widely recognized computer vision task that involves the identification and matching of a specific building or structure in both drone/satellite and query satellite/drone-view images [1].In recent years, there has been an increasing focus on this topic, driven by its practical significance in the fields of drone navigation and drone-view localization.The primary obstacle in this endeavor involves acquiring a resilient feature that can effectively handle significant variations in viewpoint.The model's generalizability is significantly challenged due to the distinct nature of the satellite and drone images, which can be regarded as two separate datasets characterized by substantial differences in viewpoint and background.
There are two ways in general to train a deep neural network for this task.The first is the classification-based method with identification loss [1][2][3][4].In this approach, to classify both satellite and drone images, it takes two classifiers in the training phase.The testing phase only requires measuring similarity between images, so the classifiers will be dropped after training.This takes huge extra memory and computational costs when forwarding and backwarding scalars.
The second is to adopt deep metric learning with a triplet loss or pairwise matching loss [5][6].This usually improves the model's discrimination ability because it provides more challenging samples for model training.The implementation of metric learning can involve the utilization of classification modules, pairwise matching modules, or a combination of both [6][7].However, as previously stated in the final paragraph, the utilization of classifiers incurs significant additional expenses.Therefore, it is more advantageous to exclusively train using a pairwise matching method.This also brings another benefit: classification based training tend to make the features of same-id images as similar as possible.But in the cross-view setting, satellite-view and drone view images are very different even if they have the same identity.In contrast, only matching between image pairs during training allows same-id images to have different features, as long as they are closer together than a negative sample.This makes model training more intuitive and thus higher-performing.
Due to reasons mentioned in the last two paragraphs, following a recent person-reidentification work QAConv-GS [8], this paper implements and modifies a deep metric learning method to solve the cross-view geo-localization task.With comparison to recent SOTA methods and ablation study, this paper verifies the effectiveness of the implemented modules.On the University-1652 dataset [1], the modified model reaches 97.72 Rank-1 and 93.31 mAP with satellite→drone and 93.94 Rank-1 and 95.03 mAP with drone→satellite.

Related Work
Deep metric learning methods are widely used in crossview geo-localization, and have reached great performance.These approaches can be roughly divided into loss function design and pairwise matching schemes.The most popular loss functions in this task are pairwise loss [6], classification loss [9], and triplet loss [5].There is also a newly proposed loss function called instance loss [10], and it has greatly improved model training.As for pairwise matching schemes, Ming et al. proposed a method to fuse and align different parts of the transformer feature [11].Wang et al. match local areas of the feature map to measure correspondences between drone and satellite images [12].These schemes work by matching parts of the feature map, and they save the computing and memory costs by dropping the classifier when training.
In this paper, we implement a similar but more effective scheme.Following the QAConv-GS work, a matcher that measures the local correspondences between whole feature maps is implemented.It compares every part of the feature map, and an attention module is added to discard trash information.Before every epoch, graph sampling is done to mine the hardest data for training.Collectively, these methodologies substantially enhance the performance of the model.

Training Pipeline
In order to train the model for both satellite→drone and drone→satellite matching, there are two training phases: one of them takes satellite images as an anchor, and drone images as positive and negative samples, and the other is the opposite.The training pipeline generally includes cross-view graph sampling to form mini-batches, a matching scheme to evaluate similarity between samples, and a loss function that backwards scalars.

Cross-view Graph Sampling
For metric learning with pairwise matching schemes, a sampling method is needed to find positive and negative pairs for model training.These pairs are rapped into mini-batches that are fed into the model during training.The traditionally used PK sampler works by randomly selecting P identities and K samples from each identity to form a mini-batch [13].This random method, however, cannot always provide informative data for training.Thus, this paper implements a Graph Sampler to improve discrimination abilities [8], as shown in Fig. 1.This sampler was originally designed for the personreidentification task, where images don't fall into two different distributions like in the cross-view matching scenario.So it needs to be adapted for this new task.Specifically, take satelliteTdrone as an example: the total number of classes used for training is denoted by C. At the beginning of every epoch, for each identity, we randomly select a satellite-view image and a drone-view image to represent the whole identity.Then, the newest trained model is brought up to extract the feature maps of the selected images, where the feature maps are denoted by F Sat ∈ (C × D)and F dro ∈ (C × D), and the number of dimensions is denoted by D. By calculating the similarity between the feature maps, a similarity matrix S ∈ (C × C) then implies the top P-1 most similar droneview identities for each satellite-view identity.After that, a mini-batch of size B = 2 × P × K is formed by randomly taking K images from any satellite identity, plus K images from the same drone identity as positive samples, and also from its P-1 neighboring drone identities as negative samples.To provide more sufficient data, this paper also randomly takes K images from the corresponding P -1 satellite identities.Together, the size of the mini-batch is B = 2 × P × K.
The same process also works for satelliteTdrone training by simply taking K images from a drone identity and its P-1 neighboring satellite identities.Due to the fact that these neighboring identities include samples from different classes that look the most similar to each other, by implementing graph sampling twice, the model can train with the most difficult image pairs that make it more discriminant, both in satellite→drone and drone→satellite matching.

QAConv with Attention
To accurately calculate the similarity between two images, a robust matching module is required.This paper implemented a person-reidentification matching scheme QAConv [7], and improved its performance by attaching an attention module.
For a query image, QAConv works by replacing the fixed convolution kernels in the neural network with local patches of a gallery image feature map.In practice, each image is fed into a forward neural network, resulting in a feature map of size 1 × D × H × W, where D is the number of dimensions, H and W are the height and width of the feature map.Then, a feature map of size 1 × D × H × W is divided into local patches of size S × S, and reorganized into convolution kernels of size HW × D × S × S .After doing convolution with these kernels, a similarity matrix of size 1 × HW × H × W is then produced as the final result.
The similarity matrix reflects the detailed matching results from every location on the feature map.In order to transform it into a similarity value, a global maximum pooling is performed on the last two dimensions of the similarity matrix to find the best local correspondences as a 1 × HW vector.It eventually goes through a BN-FC-BN block and outputs the final value that indicates whether the two images belong to the same identity.
But as a previous study reveals, feature maps often contain invaluable information that doesn't contribute to classification or matching.It is the same for the similarity matrix M of size 1 × HW × H × W, since it is the result of convolution between two feature maps.So instead of considering different elements of M, this paper attaches an attention module to learn the weight values of every location of the feature map.Specifically, the similarity matrix M is fed into an adaptive average pooling to adjust its size while maintaining an overall view of the whole correspondence map, resulting in a set of weights of size 1 × HW.It's then fed into an excitation layer with Linear-ReLU-Linear-Sigmoid structure to learn weights.The weights are then timesed with the similarity matrix, and emphasize and discard parts of the local correspondences to make model training more efficient.The basic structure of the matching scheme is shown in Fig. 2 and Fig. 3.

Loss Function
This paper implements a graph sampler to provide minibatches for training, and QAConv with attention to effectively matching image pairs.On top of that, a triplet loss function is needed to make use of those positive and negative pairs.With sufficient discrimination ability of the model and matcher, this paper implements the batch OHEM triplet loss for model training [13]: where S = {s i a , i = 1. . .P, a = 1. . .K} and D = {d i a , i = = 1. . .P, a = 1. . .K} include all the satellite and drone images from P identities in a mini-batch,θ is the network parameter, f θ is the backbone model, and s(. ) is the matcher module.

Implementation Details
The implementations in this paper are adapted from the official code of QAConv-GS [8].Specifically, due to the transview matching nature of this task, this paper chooses a ResNet-50 with IBN-B layers to improve generalizability.The first three layers of the model are selected to extract a feature map, following the method of QAConv-GS.For the graph sampler, the batch size B is set at 64, with 32 satellite images and 32 drone images.The number of instances for every identity K is set to 4. The margin of OHEM triplet loss is set to 16.A SGD optimizer with a 0.005 learning rate is deployed for model training.The maximum training epochs are 60, and an early stopping triggers if loss decreases to 0.75 times itself.The learning rate decays by 0.1, and the training only lasts for half of the previously trained epochs.

Dataset
Only one cross-view geo-localization University-1652 is used in this paper [1], since it is the only challenging large-scale dataset in this area.Results on other datasets like cvusa and cvact are not so convincing due to the fact that the latest SOTA works have already reached 96 + mAP on them.The University-1652 dataset includes satellite and drone-view images taken from 1652 buildings of 72 universities around the world, with 701 of them used for training, and 951 of them used for testing.There is no overlapping between the train and test data.

Comparison to the State of the Art
Table 1 shows the comparison of this method to the state of the art (SOTA) in cross-view geo-localization.The method implemented in this paper has surpassed all of them by a large margin.As Table 1 shows, the Rank1 and mAP on University-1652 is improved by 3.42% and 3.92% with satellite→drone and 4.30% and 3.71% with drone→satellite.3 compares the performance of the QAConv method with and without attention layers.With attention layers, the Rank-1 and mAP on University-1652 are improved by 0.58% and 2.66% with satellite→drone and 2.32% and 1.90% with drone→satellite.This is mainly due to the implementation of adjustable weights on every location of the correspondence map; the model can dynamically add or reduce importance to local correspondences, and focus on parts that benefit training the most.This leads to faster convergence and better performance.With Fig. 4 and Fig. 5, this paper shows the effects of different batch sizes and loss margins.We can notice that accuracy generally rises with larger batch sizes, but comes to a halt at 64.As for loss margins, performance slightly improves with a bigger margin for stronger discrimination abilities, and it achieves the best at margin= 16.Performance drops when margins continue to increase due to more learning difficulties.

Conclusion
This study applies the recently introduced person reidentification method QAConv-Gs to address the complex cross-view geo-localization task.In addition, attention layers are incorporated into the query-adaptive convolution layers in order to assign weights to the correlation map, hence enhancing or diminishing the significance of various regions.This approach achieves state-of-the-art performance in the given job, outperforming recent methodologies by a significant margin.This study provides empirical evidence for the significant potential of deep metric learning, specifically when paired with pairwise-matching techniques, in effectively addressing cross-domain matching tasks.It is posited that these methodologies have potential use in several domains, including but not limited to multimodal learning and expression identification.
The efficacy of the methodology employed in this study demonstrates promising results for the cross-view geo-localization challenge.However, it is important to note that its performance has yet to be validated using real-world data.Based on a recent study (17), it has been observed that environmental variations, such as alterations in weather conditions and illumination, have the potential to lead to the failure of learnt models.Additionally, it is crucial to optimize the model's weight and speed, as it is intended for deployment on unmanned aerial vehicles and satellites.In further study endeavors, we will undoubtedly delve more extensively into these matters.

Figure 1 .
Figure 1.A demonstration of how graph sampler works.Different shapes indicate different classes, while different colors stand for different viewpoints.The colored figures are randomly chosen from neighboring classes to form a mini-batch.

Figure 2 .
Figure 2.This paper's implementation of QAConv on cross-view matching

Figure 3 .
Figure 3.The structure of the added attention layers.A 'Squeeze-Excitation-Scale' process is carried out to deploy weights to local parts of the correspondence map

Table 2 ,
this paper compares two different sampling methods for metric learning, the graph sampler used in the experiments and the traditionally used PK sampler.The implementation of the PK sampler follows that of the graph sampler, with every mini-batch built around one identity.The results show that the PK sampler lacks of ability to find informative enough data for training, due to its random sampling nature.In contrast, the graph sampler provides the model with the hardest samples throughout the whole training process, resulting in huge performance improvements.

Table 3 .
Comparison of model performance with and attention layers(%)