Towards Blur Mixed Instance Search Combined with Image-wise and Object-wise Feature

With development of intelligent robots and autonomous driving, images/videos with motion blurs occur frequently. However, these images/videos with blurs might be retrieved. Instance search has been studied for many years in the computer vision field. However, few works are focused on instance search on blurred images. In this paper, we propose a framework of instance search on blurred image datasets which is built on three components including query blur mixing, image-wise feature based ranking, and instance-wise feature based re-ranking. Due to lack of the benchmark, we also collect and build a blurred image dataset on which we conduct performance experiments. Experimental results show that our solution is significantly promising in instance search task.


Introduction
Intelligent robots and autonomous driving vehicles usually use cameras to collect videos and continuous images to enhance tasks such as environment recognition and navigation.Among computer vision tasks, object re-identification can recognize and label the same object.If object detection is directly used as a basis of object re-identification, we need to train an independent object detection model for each type of object.Therefore, it is necessary to detect the objects in each frame of videos and continuous images for finding the specific object and check whether it is the same object in the query.As a basic task, the Content-Based Image Retrieval (CBIR) directly retrieves the similar images to the query image.However, in general, the aim of the Instance Search (IS) task is to retrieve from a database to fetch images that contain an instance of the query, e.g., provided as a bounding box within a query image, even when the object only occupies a small part of the image.Therefore, the IS is useful to vision semantic analysis in multimedia and videos for scenarios of environment recognition and navigation.In other words, Instance search can be treated as a cross domain of Object Detection (OD) and CBIR, because it not only considers about the whole image feature (like CBIR task) but also needs to find the specific instance (like OD task).Obviously, compared with CBIR, IS task is a more advanced and more challenging task because of the query instance diversity and the response timeliness.
In the real scenarios, due to the calibration of cameras which are mounted on mobile intelligent vehicles and robots, there will be a large number of blurred frames in the collected videos and blurred images, which will deteriorate the performance of instance search.Motion blur not only decreases the visual quality of images but also effects the performance of visual tasks significantly.At present, many algorithms have been proposed to solve the problem of object 2 detection accuracy degradation due to blurs in images [1].Generally, these algorithms make use of some image argumentation methods to generate datasets which contain images with blurs, and train the Convolutional Neural Network (CNN) for object detection.However, there is not ever a direct literature on instance search towards images with blurs.
In this paper, we will study Instance Search with aid of the task of object re-identification from videos or images with blurs.Firstly, we employ an object tracking algorithm to identify specific kinds of objects, and label the objects for these images with blurs in the real dataset which is used as a benchmark.Secondly, we synthesise a blur mixed query which is designed to adapted a pre-trained CNN for the IS to fetch the frames/images that may contain the same instance in the query.To examine the effectiveness of our solution, we define the blur degrees to which extent the video/image is blurred.Therefore, experiments can be carried out on image queries with different blur degrees.We also use the ranking mean average precision as the testing measure, which is defined by the position of the correct retrieved items in the ranked retrieval result.The crucial GOPRO data is used for our instance search as the blurred video or image gallery.
The main contribution can be summarized as below: (i) We propose a framework of instance search on blurred images.To our best knowledge, it is a first trial in comparison with other related works towards clear objects.(ii) A blur mixed query is designed to adapt to be combined with image-wise and instance-wise features for instance search on the blurred images.(iii) We make up a dataset like TRECvid INS to conduct performance experiments on it.
The rest of this paper is organized as follows.The related work are discussed in section 2. In the section 3, we propose a framework for instance search with blurred images which introduces the three main components.In the section 4, we give some discussions on how to build the blurred image benchmark.In section 5, we also conduct extensive experiments on performance of our proposed solution and discuss evaluation metrics.Section 6 concludes the paper and gives future research directions.

Related Work
In this paper, we address the motion blur problem in instance search.For this reason, we focus on relevant works in instance search first, and then review related works on motion blur influence in image retrieval.

Instance Search
Instance Search is usually inaccurately addressed as a conventional image search task just like CBIR, which is the problem of searching for relevant images in an image gallery by analyzing the visual content, given a query image.However, CBIR is only focuses on the overall features and characteristics of the entire image, while IS emphasizes features of the certain query instance due to the objective of retrieving the images that contain the query instance.Besides, the query instance is usually specified by a bounding box within an image or a frame of a video, while each gallery image may have many other instances (objects other than the query instance).Therefore, in general, Instance Search is to seek, to detect and to identify a certain instance from the whole scene in a image.
In the recent researches, most of instance search approaches are based on pre-trained CNN model [2][3][4] [5] [6].At first, researchers usually choose the output of the last fully connected layer as the feature vector of a image [7], which is a way to directly use global features in instance search.Later, researchers found that selecting some convolutional layers in the feature map can achieve better results [4].Authors in [3] propose to compose the sum of activations of each filter in a convolutional layer as a feature descriptor.Besides, based on the idea of RCNN [2], some literature also extracts region features of images and aggregates them.For example, authors in [8] propose a compact descriptor named R-MAC to get aggregation of multiple region features in their feature map.Authors in [9] takes a hybrid method to extract both image-based and regionbased features in CNN during a single forward pass.Then, in their later research, they propose to use the Bag of Words (BoW) model to aggregate local CNN features for scalable instance search [10].FCIS+XD [11], instead of pixel-wisely features, extracts instance-level features from the instance segmentation map of a fully connected convolutional network.Additionally, authors in [12] try to explore images in medical field by retrieval.
These models have a good performance on the general instance search benchmark.However, there is still not a specific benchmark or dataset to evaluate their performance and robustness when faced with images with blurs.As be known, even traditional image retrieval, not focusing on instance, also must face the same problem due to the lack of blurred benmark.

Motion Blur in Image Retrieval Tasks
In terms of Object Detection and Visual Object Tracking (VOT), some literature studies how to improve the robustness to maintain the accuracy when encountering corruption situation [1] [13], such as images or videos with noises and motion blurs.It is also significant for IS to improve the precision and accuracy while searching instances in images with these corruptions.Since Instance search task is approached under the applications similar with OD and VOT, it is also vital to improve the robustness of instance search with motion blurs.In fact, there is few existing works about instance search especially in blur scenario.This paper should be the first one to discuss about it.
Currently, the most authoritative benchmark on evaluating the performance of the IS is TRECVID [14].And since occurrence of TRECVID, some other benchmarks or datasets according to TRECVID are proposed over the past years.However, most of these datasets are not publicly available for researchers.Without access to these benchmarks, some classic image retrieval datasets can also be used as the benchmark for IS.Most of these datasets are about landmarks like Oxford5k [15], Paris6k [16], Holidays [16].
However, applicable public datasets do not exactly match the real scenarios as we mentioned before, as even there are few images with motion blurs among those in public datasets.Therefore, in the later section 4 we propose a new benchmark for IS with motion blurs, which is even better in line with our topic.

Methodology
In this section, we give our main methods for instance search for images with blurs.Firstly, we give a framework in Figure 1 which contains three components: query processing for blurring, CNN based image-wise feature for first filtering, RPN based instance-wise feature for re-ranking.

Instance Search with Blurred Images
First of the subsection, we want to give a unified definition on the task of instance searching with blurred images accurately.As Figure 2, we assume that the qurey image with query instance is sharp and with no motion blur.

Mixing module
In other vision tasks, such as object detection or visual tracking, preprocessing images with a deblurring algorithm directly is a common way to reduce the influence of motion blur.In another word, they apply deblurring algorithms to make images more sharp and clear, and then do conventional detection or tracking steps on images already deblurred.However, in instance search or CBIR tasks, it is absolutely not a good choice to do deblurring pre-process just like other vision tasks on every single image, especially when faced with the scenario of large scale retrieval.Because if we do deblurring commonly, it will take a huge time overhead to pre-process all images including query images and and gallery images before we do retrieval.This means we almost gives up the requirement for real-time retrieval, and it will certainly make this kind method lose its application in many online fields.
Inspired by the method of image augmentation in deep learning, we get a thought of add blur elements and features to the query image to make it more similar to gallery images and get closer to gallery images in the feature space.
As Figure 1 shows, the framework of the instance search for blurred imaged mainly contains three modules: mixing, ranking and re-ranking.
At the first part, an instance search query is passed to the search system.We consider the given query image I of size W I × H I .It will be taken to a blur augmentation, which is to make query image blurry by using simulation blur kernels with different blur degrees, shown in the most left red and blue dashed boxes, and we get n blurry images I b1 , I b2 , I b3 ...I bn .Then, there are two branches in our method framework to mix up the original image I and blurry images I b .We use different mixing strategies in the mix stage: 1) Pixel-level mixed shown by red above; 2) Feature-level mixed shown by blue below.
Pixel-level mixed.To directly add up query image and its blurred images per pixel in RGB values and then take the average.The query image is of the size W I × H I and we consume the RGB value of each pixel in I is V I x,y , where x, y represents the position of the pixel in the image.And then we compute the weighted average of RGB values of I and all I b one by one pixel to get the new image I M with RGB values V M .The formula (1) below shows the process: We treat the mixed image I M as a single new query image input into the typical deep convolutional neural network structures such as ResNet50 [17], VGG16 [18] and so on.The features can be gained by activation of these bottleneck network blocks to express the global information.
Feature-level mixed.To extract the CNN features of the query image I and its blurred images I b first.After extraction, we get the feature vector F I with dimension d of query image and those of blurry images F b 1 , F b 2 ...F bn .Then to mix the query feature vector and blurry feature vectors with one of the three aggregation strategies: summing, averaging and maximizing to get the new mixed feature vector F M .We assume that the value of a specific dimension z of the feature vector F y is f z y .The formulas below demonstrate the three strategies on the level of f , respectively.
After the comparison of the strategies above by some simple tests, we find that max strategy has the best effect on results.At last, we treat the new mixed feature as the actual query feature which will be used afterwards.

Ranking module
The second part is much like to the image representation in [9].We may use cosine distance, thus is cos , to compute the similarity between features.Then the result of mixing stage will be passed into ranking stage as a image-wise feature, which benefits the first filtering of the whole image gallery (see Figure 1) to find a coarse ranking which provides the most relevant results to query.Only the top ranked the images are passed to next re-ranking process.In the actual process, we will apply some normal retrieval tricks to strengthen the performance, such as query expansion (QE) before the re-ranking stage.QE means that the image features of the top M results of the ranking stage are averaged together with the query image feature to perform a new search.

Re-ranking module
In re-ranking process, the Regional Proposal Network (RPN) feature is used to describe the instance-wise information.The RPN widely receives attentions from object detection and other computer vision tasks.Our method uses the RPN from the offthe-shelf pretrained model Faster R-CNN.RPN provides local and subtle cues which point out the instance-wise characteristics and location in the image which contains it.We can get n candidate regions where the instance may exist.And after ROI pooling using proposal regions and convolutional feature map, we can get the feature vectors R 1 , R 2 ...R n of these regions.Then, we can then aggregate local features of these regions R and the global feature F easily to get the eventual feature E for re-ranking.The formula shows the process concisely, where ⊕ can be any aggregation method,such as maximize, sum or average: With aid of the previous process, only a few images are re-ranked, so the amount of computation is reduced a lot.Besides, the RPN also predicts the position of the query instance in the retrieved images.
Among the three modules, the first part can be implemented by balancing the weights of each blurred query instances.The other two parts are introduced by constructing networks like the Faster RCNN and the image-wise feature and instance-wise feature are easy to get separately from ResNet backbone networks and ROI pooling layer which is applied on the suggested instance boxes by the RPN.

Dataset Construction
As mentioned in section 2.2, there is still no dataset which properly fits the physical world and real scenarios in instance search.Moreover, the existing instance search datasets focus on sharp images and contain few blurry images.Thus, we propose a instance search benchmark especially for assessing task robustness on images with motion blurs inspired by COCO-C [19] and Blurred Video Tracking (BVT) benchmark [13].It is named Blur Instance Search (Blur-INS) benchmark for our IS task.Blur-INS simultaneously satisfies the image and video scenarios as above.In this section, we elaborate how to construct the dataset of Blur-INS benchmark, including data collection, data distribution and evaluation of robustness w.r.t. the degree of motion blurs.
Among all the existing datasets for IS, TRECvid INS [14] is a subset of TREC Video Retrieval Evaluation [20].It contains 20,000+ key frames from TRECvid.Some other classic image retrieval datasets, such as Oxford5k [15] and Paris6k [16], usually define queries first then collect the images gallery.It causes that target objects clearly and apparently appear in the images.While TRECvid INS determines data first and then choose specific objects or persons, which may not be the main part of images.Therefore, it is not only closer to real world applications but also more challenging.
However, TRECvid is only open for its participants and there are not enough specific images to assess motion blur robustness of instance search in other datasets.Due to these reasons and We borrow 33 sequential images sets from deblurring dataset GOPRO with their simulation motion blurs [21].We also apply image augmentation technologies to produce synthetic motion blur with different blur kernels and random blur angles.We generate ground truth bounding boxes on the sharp images with SOT algorithm [22], then directly take these annotations as the bounding boxes of corresponding blurred images.For each image sequence, we take an instance of first frame as the query, and take the other frames to create the image gallery.While, we only take the frames the instance appears as true-positives of the query.Figure 3 concisely shows the distribution of the true-positives in Blur-INS.

Experiment
We conduct ablation experiments to ensure that motion blur indeed effects the result of instance search and validate the technical correctness and feasibility of our method.For verifying our solution, we conduct many experiments with different settings.We classify the data into two classes, the one is images with light blurs, another is with heavy blurs.Here, we use a blur degree judgement algorithm [23] which exploits some blur kernels to adjust or measure the blur degree of the certain image.Blur degree is mainly affected by the size of blur kernel.We sets the blur degree with a parameter kernel size K which is larger when the image is more blurred.We set the images with light blurs by K = 10, medium ones by K = 30 and heavy ones by K = 60.We can see the image samples with different blur degrees in Figure 4. From left to right, the three images have different blur degree with K = 10, 30, 60 respectively.

Mixing module
We also add a lot of confusing counterexamples and distractor images to the dataset to get a better evaluation on the robustness.They have some features similar with the query, e.g. with the same backgrounds or scenarios.There is a confusing image example in Figure 5 where the most right image is a confusing image which has the same scenario with others but does not have target instance.
The basic statistics of the dataset is given in Figure 3. Here, we exploit the mean average precision(mAP) metric to evaluate the performance of the retrieval task.We summarize the results in Table 1.For reality and fairness, the method used to generate blur database images is different from that used in the mixing stage.Considering that our main goal in the mix stage is  1.We can see that results in concise set and lighter blur degree parts are better than others.It proves that motion blur do influence the performance of instance search not slightly.Besides, it also indicates both our mix strategies can benefit the search results.The best mAP value is shown in bold font.It shows that the higher the difficulty is, the weaker or even ineffective QE effects, such as in the heavy concise and medium confusing sections.When the difficulty is high, the pixel-level mixing strategy is far inferior to the feature-level, and the performance on medium confusing is even worse than the sharp query without any processing.We analyze this situation that the pixel-level mixed blurred transformation seems the noise which cannot be too much else it will bring negative effects on the results.While when it becomes less difficult, the two mixing strategies show almost the same effectiveness.

Conclusion
In this paper, we propose an instance search framework and a blurred image set named Blur-INS by combining image-wise features and region-wise features.We first rank gallery images roughly according to global feature vectors, then re-rank the candidate items according to their region features.In addition, we adopt two mixing strategies to merge sharp images and blurred images to enhance the query feature: pixel-level and feature-level.Experiment results show that within a certain blur degree image fusion do improve the performance with about 4% to 7%, while heavy blurs have a bad impact indeed.In the future, the RPN part is considered to be modified and even replaced to improve the ability of local feature representation to augment the results of reranking.

Figure 1 :Figure 2 :
Figure 1: The framework of instance search for blurred images

Figure 3 :
Figure 3: Distribution of Blur-INS.The X-axis represents the amount interval of positive samples a query has in the dataset Blur-INS, and the Y-axis represents the amount of queries corresponding each amount interval of positive samples.

Table 1 :
Performance (mAP) of instance search with different difficulties (concise or confusing set) and blur degrees on the datasets Blur-INS.The strategy we used for feature-mixing is strategy max.motion blur features to the image, we just use the regular linear blur generator with different kernel sizes and angles.But when it comes to the gallery images, we want to simulate as much as possible, so we adopt a non-linear with different kernel sizes, angles and intensities, which aims to simulate the shake of real-world camera.The image gallery is classified into two kinds: concise set (without confusing image) and confusing set.For each kind of set, we further divide them into different blur degree parts.For each division, we try different mixing or QE strategies and calculate the average mAP values in the Table