Weakly supervised object detection based on deep metric learning

Weakly supervised object detection is a hot issue in the computer vision field, which aims to train a high performance detection model with low cost annotation data. The existing methods of weakly supervised object detection only summarize the object category but don’t consider the similarity between the objects in the optimization process. For solving this problem to improve detection accuracy, this paper proposes a weakly supervised object detection model based on deep metric learning. In the initial training phase, an initial metric has been learned in advance to measure the similarity between these objects; in the correction phase, we propose an adjacent instance mining method based on proxy samples, this approach expands the model’s recognition view, and prevents premature locking the wrong object position. We design a series of experiments on the PASCAL VOC2007 dataset to prove the effectiveness of this method.


Introduction
Object detection is a difficult problem in the field of computer vision. It refers to the use of corresponding detection methods to identify whether the image contains a predefined target object, and to obtain the position of the object in the image. It is widely used in scenes such as object tracking [1] , behaviour understanding [2] , and instance segmentation [3] . At present, the object detection method usually adopts a strong supervised learning method, which requires bounding boxes labelling for a large number of image targets, its cost is very expensive. In response to this situation, some researchers have proposed weakly supervised object detection (WSOD) methods and obtained better detection results through low-cost label information. This method only needs to determine the object location based on image-level labels. However, only using rough image-level labels cannot fully express the information of the object. And the existing methods mostly use multiple instance learning (MIL), it only summarizes different category concepts but fails to optimize feature distance according to the similarity of samples, resulting in poor classification performance.
Aiming at the problems of the aforementioned WSOD methods, this paper proposes a deep metric object detection model-DMOD. The model is divided into two phases: initialization and correction: In the initial phase, the model uses few image content in the ground truth bounding boxes to learn an initial measurement method between samples, which is based on the visual similarity between different objects to optimize its feature distribution. In the correction phase, this paper proposes a correction learning method based on neighbor clusters, which aims to mining the useful information around the central object, and prevents prematurely locking the wrong position. some experiments in this paper have been designed on the PASCAL VOC 2007 dataset to verify our method.

2.
Related work In recent years, WSOD has received widespread attention. Some studies tend to directly train an endto-end object detection model [4][5][6][7][8][9] . They integrated the MIL method into the network training, only applied image-level labels to train a basic classifier, and then assigned new supervision information to the object region for gradually optimizing it. P Tang et.al. [8] proposed a multiple streams detector optimization method for the WSOD model. RG Cinbis et.al. [5] proposed to iteratively refine the detector and continuously correct the position of the object during the training process. C Lin et.al. [7] proposed a MIL-based object instance mining (OIM) framework, which tries to detect all possible object instances in each image by introducing information dissemination in the space and appearance graph. This kind of learning approach treats the whole image as a "bag" composed of various instances, and the image has been simply defined as a "positive bag" or a "negative bag" just by judging whether the "bag" includes instances of a certain category or not, without considering the similarity between these instances in the "bag". The difference of our method is that a small amount of image content in the ground truth bounding boxes is used to learn the distance measurement between each instance in advance, so as to reduce the noise of the supervised signal in the model optimization process.

3.
DMOD Our model is divided into two phases: the initial phase and the correction phase. During the initialization phase, an initialized measurement model can be obtained from instance-level image set. During the correction phase, multiple regions with relatively close spatial positions are obtained for the central object, which will be used as positive samples to participate in the optimization of the initial DMOD model. As shown in figure 1, the model continuously iteratively evolves and tends to be stable after k times of correction learning. Figure 1. Algorithm process. The few instance-level dataset is extracted from data for training and getting initialized DMOD model. Then the adjacent instances of the target are mined according to the image-level label information.

Initialization metric
In this paper, the metric model structure as shown in figure 2 is designed. We adopt the ResNet structure of pre-trained on large dataset as the M-P and M-G subnetworks of DMOD, and the subnetworks participate in the training by using parameter sharing strategy. Behind each subnetwork, a nonlinear transformation module is designed to map the distributed features learned from the subnetwork to the metric space, and its parameters are not shared. During the training process, the input and output of the model are the sample pairs and the similarity s between them. An instance-level image set GT can be obtained from a few amount of image contents in the ground truth bounding boxes, and its label set is , N is the number of classes.
Meanwhile, instance samples gt GT  , and its label l   . In the training process, the images that constitute the positive sample pair belong to the u L class instance sample, and the similarity label 1 y  is assigned to the sample pair as the positive sample of metric learning. Then the positive sample set is: On the contrary, the images that constitute negative sample pairs belong to different u L and v L classes respectively, and similarity labels 0 y  are assigned to the sample pairs as negative samples of metric learning. Then the negative sample set is: The DMOD generates fixed-size feature vectors i F and j F for i gt , j gt , and outputs the similarity ij s of sample pairs: In order to enlarge the distance between different classes and reduce the distance within the same class at the same time, we apply the contrastive loss function training network: Where 0 L represents the loss function of the DMOD model in the initialization phase. A is the number of sample pairs in the training set, and e is the distance margin value.

Correction learning
Firstly, R proposals are generated for the input image by selective search [10] method, and the proposals set  In the label vector To facilitate the calculation of loss function, ˆk n y is encoded by one-hot as: Where r is the serial number of the proposal, and 1( ) In the correction learning, when 0 k  , the model generates N+1 dimensional similarity vectors for each proposal, and the value of the N+1-dim is the mean value of similarity between the proposal and all regional elements in the background cluster. In this paper, softmax normalization of r and n directions of ( )   x represents the contribution of r b in the positioning of the class n object. The final normalized score matrix is the Hadamard product [11] of k r x and k n x : k k k n r S x x   , and the loss function of the k-th correction learning is: Where k ri S represents the similarity score after normalization.

Experimental results and analysis
This paper evaluates our model in PASCAL VOC2007 dataset, which contains 20 object categories, the trainval set contains 5011 images and the test set contains 4952 images in total. We train and evaluate the model in trainval set and test set respectively. During the test, mean average precision (mAP) and correct localization (CorLoc) were considered as evaluation indicators.
The experiment compared the influence of VGG19 and ResNet50 on the detection results. As shown in figure 4, when the network tends to be stable in the correction process, the performance of ResNet50 framework in the mAP indicator is 7.4% higher than the VGG19 framework. If the same data augmentation (DA) is applied to the training data of different frameworks, their performance is improved respectively. The experiment shows that the highest accuracy rate of 49.1% can be obtained by ResNet50 structure and data enhancement measures. As shown in figure 5, the mAP of our method improved by 0.8 percentage points compared with SDCN. At the same time, table 1 shows that the CorLoc of our method reaches 67.1%, which is higher than other traditional methods.  Compared with some traditional weak monitoring methods, our performance has been improved to some extent.

Conclusion
In this paper, a DMOD framework based on depth metric learning is proposed to solve the problem of poor classification performance of traditional methods. Different from MIL method, deep metric learning can fit a measurement function from the respect of sample distance, which makes the distribution of similar samples more compact and heterogeneous samples more distant in the metric space, thus greatly reducing the difficulty of classification. The experiments show that our method is effective in weakly supervised object detection.