DIM: long-tailed object detection and instance segmentation via dynamic instance memory

Object detection and instance segmentation have been successful on benchmarks with relatively balanced category distribution (e.g. MSCOCO). However, state-of-the-art object detection and segmentation methods still struggle to generalize on long-tailed datasets (e.g. LVIS), where a few classes (head classes) dominate the instance samples, while most classes (tailed classes) have only a few samples. To address this challenge, we propose a plug-and-play module within the Mask R-CNN framework called dynamic instance memory (DIM). Specifically, we augment Mask R-CNN with an auxiliary branch for training. It maintains a dynamic memory bank storing an instance-level prototype representation for each category, and shares the classifier with the existing instance branch. With a simple metric loss, the representations in DIM can be dynamically updated by the instance proposals in the mini-batch during training. Our DIM introduces a bias toward tailed classes to the classifier learning along with a class frequency reversed sampler, which learns generalizable representations from the original data distribution, complementing the existing instance branch. Comprehensive experiments on LVIS demonstrate the effectiveness of DIM, as well as the significant advantages of DIM over the baseline Mask R-CNN.


Introduction
Object detection and instance segmentation are fundamental and essential tasks in computer vision. They play a central role in many downstream tasks about instance-level understanding, e.g. person search [1][2][3], visual reasoning [4,5], and human pose estimation [6][7][8]. Over the past few years, researches on object detection and instance segmentation have witnessed remarkable progresses, yielding good returns in flexible frameworks such as Faster region based convolutional neural networks (R-CNN) [9] and Mask R-CNN [10], and excellent performance on public benchmarks such as PASCAL-VOC [11] and MS-COCO [12].
While MS-COCO is widely used for benchmarking general object detection algorithms, there still exists a significant gap between the category distribution of MS-COCO and our real world. MS-COCO contains 80 categories of frequently seen objects, and most categories have a large number of annotated instances. However, the category distribution in real world is long-tailed [13], where only a few head classes contain abundant instances, but a considerable number of tailed classes only have rarely few instances. To enable this kind of research, recently, a large-scale dataset LVIS [14] has been collected, which contains more than 1000 object categories that exhibit long-tailed distribution. As investigated in [14], plain Mask R-CNN [10] performs poorly for the low-shot detection task-training with very few training examples yields very poor generalization performance.
In literature, many approaches have been proposed to boost the accuracy of long-tailed image classification. Among these methods, class re-balancing strategies [15,16] have developed into a prominent family, which seek to alleviate the extreme class imbalance problem by re-sampling the examples or re-weighting the losses of the examples within the mini-batch during training. However, as pointed by [17], while these re-balancing methods promote the classifier learning of networks on tailed classes, they will damage the generalization ability of the learned features. Similar phenomena is also observed for long-tailed object detection, as the repeat factor sampling [14] strategy may lead to performance degradation for head classes. In the work of [17], the authors suggest to use a bilateral branch network, one branch is trained with the original data distribution, and is responsible for learning universal patterns, while another re-balancing branch is designed to model the tail data.
Recently, several approaches [18][19][20] have been proposed to tackle the challenging long-tailed object detection and instance segmentation task. Most of existing works extend the re-balancing ideas in image classification to better accommodate the detection framework. Equalization Loss (EQL) [18] leverage EQL to balance the gradients and alleviate the impact of discouraging gradients on tail categories. Balanced Group Softmax (BAGS) [19] propose to balance the classifiers within the detection framework through group-wise training. These methods typically affect the classifier learning and representation learning of Mask R-CNN simultaneously, and have the risk of hurting the generalization ability of learned representations.
In this paper, we tackle the challenging long-tailed object detection and instance segmentation task, by designing a novel auxiliary branch during training for the Mask R-CNN [10] framework. Our auxiliary branch will not affect the representation learning driven by the existing instance branch, but introduces a bias to the instance classifier toward tailed classes on its own branch.
Specifically, in parallel with existing instance (box and mask) branch, our auxiliary branch maintains a dynamic memory bank that stores an instance-level prototype representation for each category, and shares the instance classifier with the existing instance branch. The instance representation for each category is dynamically updated during training, by using a metric loss to pull it closer to the region of interest (RoI) features of the same category in a mini-batch, and push it away from the RoI features of different categories. With a class frequency reversed sampler, our instance memory bank enables biased classier learning towards tailed classes, which is complementary to the existing instance branch that learns from original data distribution for better representations. We term our method as dynamic instance memory (DIM), and conduct comprehensive experiments on LVIS to demonstrate the effectiveness of DIM. DIM achieves consistent improvements over the baseline Mask R-CNN. We also give ablation studies to verify the effectiveness of the design choices of our DIM.

Key terms involved in this work
Firstly, we will briefly describe the key terms (tasks) involved in this work, and discuss the difference and relationship between them.
• Object detection aims to predict the location (represented by bounding box) and semantic label of each instance of interest in an image. • Instance segmentation goes one step further than object detection, by predicting a mask for each instance along with the bounding box. • Long-tailed recognition has emerged as a family of hot research topics, as instances in real world exhibit long-tailed distribution. These topics includes long-tailed image classification, object detection and instance segmentation, etc.

Object detection
Driven by powerful convolutional neural networks (CNNs), object detection has experienced a fast improvement during the past few years. Previous object detection methods can be roughly divided into two categories, i.e. the two-stage pipeline and the one-stage pipeline. The first stage of the two-stage pipeline firstly generates a set of sparse region proposals, and then, the second stage utilizes the CNNs to predict the category and localization of each region proposal. The seminal work of two-stage detection is R-CNN series [9,21,22]. Specifically, R-CNN [21] leverages selective search [23] to extract region proposals, and applies a convolutional network to obtain the category and localization of each region independently. Fast R-CNN [22] improves R-CNN by sharing convolutional features among RoIs, which can reduce redundant calculations and increase speed. Then, Faster R-CNN [9] advances the region proposal generation with a region proposal network (RPN), which can further improve detection performance and speed. Faster R-CNN is a popular two-stage detection method, and is the foundation for many follow-up works [24,25]. Different from the two-stage pipeline, the one-stage pipeline directly regress and classify a set of regular pre-defined anchor boxes to obtain the detection results [26,27]. For example, YOLO [26] achieves the real time object detection by forwarding the input image once through an effective backbone network.
RetinaNet [27] proposes the focal loss to balance the positive and negative samples, and outperforms the two-stage method for the first time. MSSIF-Net [28] utilize the adaptive spatial feature fusion and multi-scale channel attention mechanism to improve the detection accuracy.
Both traditional two-stage and one-stage methods require pre-defined anchor boxes. Recently, anchor free detectors [29,30] have emerged as a promising detection pipeline, by replacing traditional anchor box with anchor point and developing new label assignment strategies for object proposals. For example, FCOS [30] utilizes the center point to regress the distance from the location to the center of the object. ATSS [31] first point out that the essential difference between anchor-based and anchor-free detection is actually how to define positive and negative training samples, and proposes a novel sample selection strategy to bridges the gap between anchor-based and anchor-free detection.

Instance segmentation
Instance segmentation aims to estimate a high quality mask for each instance in an image, and is now arousing more and more interest in recent years. In many early attempts [32,33], semantic segmentation precedes instance recognition, which is often slow and less accurate, because the multi-class semantic segmentation task itself is a very challenging task. For example, IFCN [33] extend the fully convolutional networks [34] to the instance segmentation task. It is designed to compute a set of instance-sensitive score maps, each of which is the outcome of a pixel-wise classifier of a relative position to instances. Recently, Mask R-CNN [10] provides a simple yet flexible instance segmentation framework, where multi-class object detection and binary instance mask estimation are simultaneously performed. At present, SOLO [35] demonstrated that a simple one-stage pipeline without predicting object bounding boxes is also promising for instance segmentation, by carefully considering the geometric structure (or location and size) of possible instances at every spatial location. APP-UNet16 [36] proposes a two-stage attention aware method to detect the defect areas in a coarse-to-fine manner. The localization stage is firstly employed to enable coarse location detection, and then the segmentation stage further segment the detection area to obtain exact results. In this work, we treat the widely used Mask R-CNN system as our baseline, and adapt it to the scenario of long-tailed instance distribution.

Long-tailed recognition
Long-tailed recognition has received increasing attention, because its setting is closely aligned with our real-world. In the past few years, the long-tailed image recognition task has been extensively studied, and several principles are established. In early researches, class re-balancing strategies (e.g. re-sampling [37,38] and re-weighting [39,40]) are the prominent methods to alleviate the long-tailed distribution issue during learning. For example, Relay Backpropagation [37] simply repeats data for minority classes to balance the categories. CB Focal [40] utilizes the effective number of samples for each class to re-balance the loss. However, BBN [17] demonstrated that these re-balancing methods can significantly promote classifier learning, while to some extent, damage the representation power of learned CNN features. Since the authors advocate to decouple the classifier and representation learning to preserve the representative power of CNNs.
Besides image classification, recently long-tailed object detection and instance segmentation have received more and more attentions. Unfortunately, existing long-tailed classification methods are not well applicable to the detection frameworks, due to the difficulty in dealing with the negative proposals (a special kind of extreme head classes) for each object category. Various approaches [18,19,41] have been proposed to handle this issue, with carefully designed strategies to balance the loss or gradient between not only the head classes and tailed classes, but also the positive proposals and negative proposals (including background proposals) for each category. Among these methods, the EQL [18] introduces a simple but effective loss, which alleviates the effect of the overwhelmed discouraging gradients on tail categories. The EQL version of Mask R-CNN achieves remarkable performance on the challenging LVIS benchmark.

Overview of pipeline
Our DIM is a plug-and-play module for the Mask R-CNN framework [10], which plays its role in an auxiliary branch during training and will be removed during inference. DIM can be directly applied to existing state-of-the-art long-tailed detectors (e.g. EQL Mask R-CNN [18]) without adaptation.
Simplicity is the central to the design of our DIM branch. We keep the core designs of Mask R-CNN unchanged. As shown in figure 1, the upper branch is the original instance branch of Mask R-CNN, which simultaneously estimates box locations/categories and instance masks. The upper branch uniformly samples the instance proposals with long-tailed distribution generated by the RPN. The below branch is our DIM branch, which maintains a dynamic instance-level representation for each category, and enables biased learning of the classifier towards tailed classes. It shares the instance classifier with the original instance branch. But when training the classifier, it detaches the instance features sampled from DIM to avoid affecting the representation learning of Mask R-CNN.

Dynamic update with instance-level metric learning
In our design, we expect that the instance-level representation in our memory bank can serve as the prototype feature for each object category. For this goal, to form the instance-level representation, we need to absorb representations from different instances from the same categories. On the other hand, we should dynamically update our instance representation during training, as the feature from backbone network evolves with the training process. Building upon these considerations, we design a simple metric loss to push the instance representations in DIM close to the instances of the same categories in a training mini-batch, and push them away from the instance from different categories.
Formally, our DIM is an 4D tensor with the shape of C × D × H × W, where C is the number of object categories (80 for MS-COCO), D is the channel number of the RoI feature for Mask R-CNN (D = 256), and H and W is the spatial size of the RoI feature of the box branch (H = W = 7). Let N denote the number of foreground proposals sampled for Mask R-CNN head. For a given object category i, let P i denote the collection of object proposals which belongs to category i, and let N i denote the collections of proposals different from category i. We assume that the proposal number of these two collections are P i and N i , respectively. Let x i denote the instance representation in our memory bank for category i, and let x ′ i and x ′ j denote the RoI feature of an object proposal of category ith and jth in the mini-batch during training. Based on these notations, we reach the following metric loss of our DIM branch for dynamic updating: Here, α is a hyper-parameter, and we set it as 0.01 in our experiments. Based on above metric learning loss, the instance memory can be dynamically updated by the sampled proposals for training Mask R-CNN head. It naturally absorbs information from different instances of same category during the training process.

Biased learning with frequency reversed sampler
The motivation that we construct an instance-level memory bank to perform biased classifier learning towards tailed classes, which is complementary to the existing instance branch that learns from long-tailed distribution. To achieve this goal, we adopt a class frequency reversed sampler to sample instance features from our memory bank, and use it to train the classifier of the existing instance branch.
For the reversed sampler, the sampling probability of each class is proportional to the reciprocal of its sample size overall the whole training set (not just within the mini-batch during training). In other words, the more samples one category has, the smaller sampling probability is assigned to this category. We construct the reversed sampler in a similar manner with [17], but compute over object instances rather than images. Formally, we denote that the number of instances for class i is N i and the maximum sample number of all the classes is N max . Then, we can construct the reversed sampler in the following steps: (1) calculate the sampling possibility p i for class i according to number of samples as (2) randomly sample a class according to p i ; (3) pick the instance feature of class i in our DIM. By repeating this reversed sampling process, we obtain a set of instance representations from our memory bank. In our experiments, we sample 128 instance samples for our DIM branch.
When training the instance classifier with samples from our DIM, we take into account that in early stages our DIM can hardly obtain representative features for each object category, and may have negative effects on the training of instance classifier. Therefore, we adopt a progressive learning strategy of our DIM branch, by increasing the weight of classification loss with the iteration number. Let T denote current iteration number, and T max denote the maximum iteration number, we have the following classification loss for our DIM branch: Note that we detach the instance features when training Mask R-CNN. We only utilize DIM to introduce bias to classifier learning, and do not hope to affect representation learning of Mask R-CNN.

Training and inference
Our DIM branch is an auxiliary branch during the training phase. It is a plug-and-play module, which does not to make any changes to other components of Mask R-CNN. Let L mask_rcnn denote all loss terms of original Mask R-CNN, the training loss of our pipeline can be written in the following form: Here, β and λ are hyper-parameters, and we set them as 1.0 and 0.5 respectively in our experiment. During inference, our DIM branch is dropped, so it does not add any computational burden for inference.

Experimental setup 4.1.1. Dataset
All of our experiments are conducted on LVIS [14], a new benchmark for long-tailed object detection and instance segmentation. LVIS provides precise bounding box and mask annotations for various instances with long-tailed distribution in category. LVIS has released two versions of datasets, v0.5 and v1.0. LVIS v0.5 contains 1230 categories, with 57 000 train images and 5000 val images. LVIS v1.0 consists of 1203 categories in total, for which the train set contains about 100 000 images with 1.3M instances, and the val set has 19 800 images. We train models on train set, and evaluate performance on val set. The categories are divided into three groups: rare (1-10 images), common (11-100 images) and frequent (>100 images). We conduct ablation studies and compare to baseline methods on LVIS v0.5, and compare to state-of-the-art results on LVIS v1.0.

Metric
For instance segmentation, besides the common mean average precision (mAP) metric across intersection over union threshold from 0.5 to 0.95, LVIS also evaluates the average precision in rare categories with 1-10 images (AP r ), the average precision in common categories with 11-100 images (AP c ), and the average precision in frequent categories with >100 images (AP f ). Additionally, we also report the average precision of detection boxes (AP b ). It is worth noting that categories in LVIS are not exhaustively annotated since it is a sparsely annotated dataset. If the detected boxes are not belonging to annotated labels in an image, they will be ignored during evaluation.

Implementation details
We utilize Detectron2 toolbox [42] to implement our method and all baseline methods. We consider Mask R-CNN [10] and the Cascade R-CNN [25] as our baselines, and use ResNet-50 and ResNet-101 [43] with FPN [44] as our backbone network. Following the convention, we adopt horizontal flipping and scale jitter as data augmentation. By default, we use random/uniform sampler for the existing instance branch of Mask R-CNN, and use frequency reversed sampler for our DIM branch. We also evaluate the repeat factor sampler (RFS) [14], which over-samples images (not instances) that contains tailed categories and is effective to improve the overall AP. We set the total batch size to 32 for training. All models are trained with the 2× training schedule. The standard stochastic gradient descent with momentum 0.9 and weight decay 0.0001 is chosen to optimize our model. We train our model for all 25 epochs with an initial learning rate 0.04, which is decayed to 0.004 and 0.0004 at 16th epoch and 22th epoch respectively. We set a 1:1 ratio between the foreground and background with 256 anchors for RPN stage, and 1:3 foreground-background ratio with 512 anchors for the second stage. We choose the top 300 bounding boxes as prediction results.

Performance comparisons
We first compare our method to baselines with cross-entropy (CE) loss on LVIS v0.5, and then compare with the state-of-the-art results on LVIS v1.0.
As shown in table 1, our DIM method achieves consistent improvement over Mask R-CNN [10] and Cascade R-CNN [25], by using ResNet-50-FPN and ResNet-101-FPN as backbone network, respectively. In particular, we observe that when using ResNet-50-FPN as backbone, our DIM can significantly improve the AP r for the rare categories by 6.7% (2.5% v.s. 9.2%) and 3.7% (8.7% v.s. 12.4%) on Mask R-CNN and Cascade R-CNN respectively, without hurting (even increasing) the performance on frequent categories. These results verify the motivation of our DIM-it can boost the classifier learning of rare categories, without hurting the representation learning of all categories (in particular for frequent categories). Additionally, our DIM can be combined with the repeat factor sampler (RFS) for further improving the performance.
Apart from the CE loss baseline on LVIS v0.5, we further compare our DIM with recent state-of-the-art methods on LVIS v1.0, i.e. CE, Equalization Loss v2 [45], BAGS [46] and Seesaw Loss [47], in table 2. We observe that our method is compatible with the state-of-the-art methods and can bring further improvements. Specifically, Our DIM can outperform BAGS with Random sampler and RFS [14] sampler by 0.9% mAP and 0.5% mAP respectively. We speculate that since our DIM is an auxiliary branch for training, equiping previous methods to with our DIM will not affect the original framework structure. In addition, we use the reversed sampler to dynamically sample category features from the DIM, the classifier can learn each category in a balanced manner, thereby our method can improve performance. Based on the above analysis, we believe that our DIM can further improve performance when integrated with more strong backbone or framework.
As shown in table 3, we have conducted experiments on PASCAL-VOC to compare our method with Faster R-CNN and RetinaNet. We observe that our DIM achieves consistent improvement over Faster R-CNN and RetinaNet on PASCAL-VOC. However, we notice that the improvement on PASCAL-VOC is not as significant as on LVIS. We conjecture that the distribution of each category is relatively balanced on PASCAL-VOC dataset.

Ablation studies
We conduct ablation experiments to verify the prediction score enhancement for rare categories by DIM, and to verify the effectiveness of proposed reversed sampler and progressive learning strategy for our DIM branch. We conduct these experiments on LVIS v0.5.

Prediction probabilities enhancement by DIM
It is often the case that even the detection head classifies a proposal as a tailed category, this category still has a very low score. Thus, the positive proposals from tailed classes are prone to be missed, due to low prediction score. In figure 2, we show the detection probabilities of different categories on LVIS v0.5 val set, averaged over all object proposals generated by RPN. We adopt Mask R-CNN with ResNet-50-FPN as backbone for experiments. It can be observed that our DIM significantly improves the detection score of rare categories, without decreasing the detection probabilities of head categories.

Reversed sampler for DIM
We use the frequency reversed sampler to train the instance classifier on our DIM branch. This design allows our DIM branch to perform biased learning towards tailed classes, without hurting the representation learning driven by the existing instance branch. As shown in table 4, our frequency reversed sampler is crucial for the detection performance of tailed classes, and improves AP r for rare categories from 2.5% to 8.2%.

Progressive learning for DIM
We perform progressive learning for DIM, by taking into account the prior that in the early stage during training, our memory bank can hardly obtain discriminative representations for all categories, while the later stage, our DIM has been updated by the metric loss adequately. As shown in table 4, our progressive learning strategy improves the performance on all metrics, including rare categories.

Effects of different α values
To explore the effects of different α in equation (1), we set α to {0, 0.01, 0.02, 0.03, 0.04, 0.05}. The results are shown in figure 3. From figure 3, we can obtain the optimal result (24.3% mAP) when α = 0.01, which proves that the metric loss can dynamically update the instance representation in DIM. Besides, the performance decreases when α = 0, the reason is that only pull the features of the same categories together can not obtain reasonable representations. Additionally, we also notice that large α may damage the performance, we conjecture that the high value of α will make the feature learning more difficult.

Effects of different β values
To explore the effects of different β in equation (3), we change the values of β in a set of {0.5, 1.0, 1.5, 2.0, 2.5}, as shown in figure 4. We observe that when β = 1.0, it can obtain the optimal performance (24.3% mAP). Either increasing or decreasing the value of β will damage the mAP. A low β cannot obtain the suitable instance memory, while a high β may increase the difficulty of representation learning.

Effects of different λ values
To explore how parameter λ affects the performance accuracy, we conduct experiments using the LVIS v0.5 dataset and set λ to {0.1, 0.3, 0.5, 0.7, 0.9} in turn. The experimental results are shown in figure 5. We can obtain the optimal result (24.3% mAP) when λ = 0.5, which prove that the reversed sampled instances can effectively alleviate the long-tailed problem. However, higher or lower λ values will affect the result, one possible reason is that too high or low value may affect the learning of classifier.

Qualitative analysis
The comparisons to existing methods and detailed ablation studies have quantitatively evaluated the effectiveness of our method. In this section, we visualize the learned memory bank, classifiers and the qualitative results to qualitatively analyze our method.

Visualization of memory bank
To explore the semantic toplogy of the learned memory bank, we adopt the t-distributed stochastic neighbor embedding (t-SNE) [48] to visualize the memory bank learned by our proposed DIM, results are shown in figure 6. It is clear to see that, the memory bank maintains meaningful semantic topology. Specifically, the learned feature of memory bank exhibit cluster patterns. For example, features (of 'refrigerator' , 'stove' , 'oven' , 'microwave oven' and 'dishwasher') within one super concept ('kitchenware'), tend to be close in the feature space. This indicates that the memory bank may not be limited to learn the independent feature of each category, but may enjoy generalization capacities.

Visualization of classifier weights
We denote W = {w 1 , w 2 , . . . , w c } ∈ R D×C as a set of classifiers with C categories, where w i ∈ R D means the weight vector for class i. Previous work [17] has shown that the value of L2-norm can demonstrate the learning tendency of the model for a classifier, i.e. the larger the value of L2-norm, the more model tend to learn that category. Following [17], we visualize the L2-norm of these weight vectors by our DIM and vanillia classifiers with CE loss. As shown in figure 7, we sample ten classes from 'frequent' , 'common' and 'rare' class frequency sets randomly for visualization, the indexes of 1-10, 11-20 and 21-30 represent the 'frequent' , 'common' and 'rare' categories, respectively. We can obviously see that the L2-norm of these classes' classifiers for our DIM are basically equal. For the classifiers trained by the baseline (CE) method, the distribution of L2-norm is impacted by the long-tailed distribution, especially in the classifier of rare category, the L2-norm value of the baseline method is much lower than our DIM. This visualization further prove that our DIM can introduce a bias toward tailed classes to the classifier learning.

Qualitative results
We give visual examples in figure 8. We note that the instances in LVIS have not been annotated exhaustively, and the detected boxes beyond annotated labels will be ignored for evaluation. We observe that one of the remote control (from rare categories) is missed by Mask R-CNN, while our method can correctly detect and segment it even with occlusion. This suggests that detecting instances from rare categories under occlusion is an extremely challenging task which deserve more efforts in the future.

Conclusions and future works
In this work, we tackle the challenging long-tailed object detection and instance segmentation task, by add an auxiliary branch with the Mask R-CNN framework during training. The proposed DIM is a plug-and-play module, which introduces bias toward tailed classes for instance classifier, and has no negative effects on representation learning driven by existing instance branch. We verify the effectiveness of our DIM by comprehensive experiments on LVIS v0.5 and v1.0. In the further, we will explore to design more effective memory bank.

Data availability statement
The data cannot be made publicly available upon publication because no suitable repository exists for hosting data in this field of study. The data that support the findings of this study are available upon reasonable request from the authors. https://cocodataset.org/#home.