Guidance Cleaning Network for Sketch-Based 3D Shape Retrieval

The sketch-based query for 3D shapes has drawn growing attention from the academic community due to its flexibility and accessibility. However, most recent works focus on reducing the cross-modality discrepancy between 2D sketches and 3D shapes, which neglect the influence caused by noise information in some low-quality free-hand sketches. To address this issue, we proposed a novel network to decrease the impact in two ways: 1) an attention weighting module to detect the noisy samples by a self-attention mechanism; 2) a data cleaning module to clear up low-quality sketches according to a ranking regularization. Experiments on two widely-used datasets demonstrate the effectiveness of our method.


Introduction
With the rapid growth of computing power of microchips, 3D shapes are used more and more widely in various industries. How to retrieve 3D shapes quickly and accurately has become a hot topic in the computer vision community. Compared with the use of texts or 3D shapes as the retrieval method, sketch-based 3D shape retrieval is more intuitive, clear and convenient because of the popularity of touch-based devices. However, the cross-modality discrepancy between 2D images and 3D shapes as well as large variations among sketches make it a very challenging task. In recent years, traditional methods based on hand-craft features have gradually been surpassed by methods using deep neural networks. Specifically, most deep learning approaches follow the pipeline in MVCNN [2] to project 3D shapes into several 2D views. Then, the features of sketches and views are extracted by two CNNS respectively, and finally the features are converted into a common space to achieve feature alignment. Nevertheless, those methods focus on reducing the cross-modality discrepancy while neglect the noisy information in sketches. As Fig.1 shows, since the painting skills vary among individuals, there are relatively larger intra-class variations and smaller inter-class variations among hand-drawing sketches compared to 3D shapes. Those low-quality sketches contain incorrect semantic information, and over-fitting to them will be harmful to the retrieval precision.
Lately, deep learning methods have been applied to detect noisy data. Specifically, a deep convolutional neural network is used to learn a weighting coefficient representing the level of importance, which increases from low-quality samples to high-quality samples. The learning of noisy data is suppressed to prevent the impact caused by low-quality images. distillation module to transfer features of sketches to pre-learned feature space of 3D shapes, (2) attention weighting module to detect the noisy sketches, (2) data cleaning module to clear up lowquality sketches.
To address the problems mentioned above, we proposed a novel framework named Guidance Cleaning Network, as shown in Fig.2. For the cross-modality discrepancy, we follow the pipeline proposed in CGN [7], which utilized knowledge distillation to transfer features of sketches to prelearned feature space of 3D shapes; For the problem of high variations among sketches, we apply the attention mechanism which learns the weight of each sketch to measure its importance to the loss function. Then, the ranking regularization in Self-Cure network [13] is adopted to split the mini-batch into high importance group and low importance group. Finally, the data cleaning module is adopted to clear up low-quality sketches. In order to combine the two aspects above tightly, we introduced a novel loss function that is mainly composed of regression loss term ( ) and classification loss term ( ). Specifically, is used to constrain the process of transfer learning, which transfers the features of sketches to the feature space of 3D shapes, and can accelerate the transfer learning process as well as assist the attention module to learn the importance weight.
Overall, the contributions of our paper can be summarized as follows: 1. We innovatively proposed a Guidance Cleaning Network for sketch-based 3D shape retrieval. A data cleaning module is adopted to clear up sketches according to importance weights gained by a selfattention mechanism, which can reduce the impact of low-quality data. We introduced a novel loss function to combine two aspects that influence retrieval precision immensely, namely large cross-modality discrepancy and high variations among sketches.
3. The experimental results of our method exceed the state-of-art approaches on two widely-used datasets.

Related work
In this section, we review some related works of sketch-based 3D shape retrieval and noise detection.

sketch-based 3D shape retrieval
The main contributions of existing methods are to extract discriminative and modality-invariant features, which could be roughly classified into traditional methods and deep learning methods. For traditional methods, features are extracted manually [1] [4]. However, the discriminative abilities of hand-craft features are limited, which have been surpassed by deep learning methods gradually. For deep learning methods, Su H, et al. [2] proposed MVCNN that extracts a compact shape descriptor by combining information from multiple 3D shape projections. Based on MVCNN method, Y Lei, et al. [5] proposed the "point-to-subspace" metric learning for feature alignment. [7] adopted distillation network to reduce the cross-modal differences. Specifically, they used the features extracted through a pre-trained teacher network to guide the training of the student network. However, those methods neglect the noise information in sketches, and overfitting them will be harmful to the retrieval results.

noise detection
Noise detection for reliability assessment and risk analysis has been studied for a long time [9]. Some approaches exploit a validation set to reduce the impact of noisy data [11]. Other methods leverage extra neural networks to learn sample weights of dataset [14][15]. For instance, Self-Cure network [13] relabeled noisy samples according to the weights learned by a self-attention mechanism. Inspired by [13], we incorporate self-attention into to the student network to decrease the impact of noisy samples.

Motivation
Due to the subjectivity of free-hand paintings, there are big intra-class variations and small inter-class variations among sketches. But most methods focus on reducing cross-modal differences while ignore the variations of sketches. In recent years, noise detection based on deep learning has been applied in the field of facial expression recognition, which inspired us to find and clear up noisy samples by deep neural networks. Different from relabeling of those samples in [13], the noise information in sketches mainly comes from the subjectivity of paintings, and it is more appropriate to simply remove these lowquality samples. Therefore, we use the data cleaning module to clear up those samples, namely the samples that have low importance weight and are incorrectly predicted during the training process.

Network Architecture
To reduce the cross-modality discrepancy as well as intra-class variations，we proposed a Guidance Cleaning Network, which consists of 3 modules: i) knowledge distillation module, ii) attention weighting module, iii) data cleaning module.
(i) knowledge distillation module Inspired by CGN [7], we use the pre-learned 3D shape feature space to guide the learning process of sketch features. Specifically, we follow the MVCNN method to train a classification network and extract class centers of 3D shapes. Let , , . . ., ∈ ℝ denotes the class center features of N kinds of 3D shapes. We assume that the 3D shapes and sketches of the same type have the same semantic information, that is, the same class centers. So, we can use the class centers of 3D shapes to guide the learning of the sketch network. In transfer learning stage, the guidance loss in CGN [7] is adopted to transfer the features of sketches effectively, which is formulated as, where M is the size of mini-batch, is the extracted feature of the sketch and denotes its corresponding label.
(ii) attention weighting module In sketch network, we introduce attention weighting module to detect the noisy samples, where poorpainted sketches are expected to get lower weights than high quality sketches. Specifically, attention module consists of several fully-connected layers following by a sigmoid function. Let α denotes the weight of sketch. To constrain the optimizing of α, we choose Logit-Weighted Cross-Entropy loss in Self-Cure Network [13] where is the classifier. (iii) data cleaning module To find out low-importance sketches, the attention weights are ranked in descending. Then, each mini-batch is divided into two groups with a ratio r. Formally, we introduce the rank regularization loss in [13] to make sure those two groups are separated by a margin m as follows: A sample in low-weights group is removed only if the predictive label is inconsistent with original one as well as the corresponding probability is bigger than the one of given label with a threshold t, which is defined as follows: where 1 means this sample will be deleted, denotes corresponding probability of predictive label, is the predictive probability of given label and r is the threshold. To sum up, the loss function of student network can be formulated as follows: 5 Our loss function aims to draw the class centers of sketches and 3D shapes closer after clear away noisy samples.

Experiments
In this section, we first introduce 2 public datasets. Then, we demonstrate the experiment setups and performance of our approach.

Datasets
SHREC 2013 is a large-scale benchmark [4] to evaluate approaches for sketch-based 3D shape retrieval.
Followed by CGN [7], the training process is divided into two stages. In the first stage, the teacher network is trained to extract features of 3D shapes. In transfer learning stage, the mini-batch is split into high importance group and low importance group with ratio r = 0.7. The margin m to separate those 2 groups is set to 0.15. The learning rate is initialized to be 0.001 which decays by 0.9 after every 10 epochs. The training finishes after 100 epochs. The data cleaning module works at the epoch (i = 10 in this paper) to remove noisy sketches, and the threshold t is set to 0.2.

Evaluation of the Proposed Method
For performance evaluation, the widely adopted metrics are used as follows: precision-recall curve, nearest neighbor (NN), first tier (FT), second tier (ST), E-measure(E), discounted cumulated gain (DCG) and mean average precision (mAP) [1].In order to demonstrate the effectiveness of the proposed method, we report our results on the above two datasets by comparing with several state-of-the-art approaches, including Siamese [17], DCML [16], LWBR [12], TCL [10], DCA [3], DPSML [5], DSSH [8] and HEAR [6].  Fig.3 and Fig.4, respectively. It is obvious that the precision rate of our method exceeds the compared methods when the recall rate changes from 0 to 1. As demonstrated in Table.1, our method also achieves best performance on SHREC 2013 and SHREC 2014 using standard evaluation metrics. For instance, the proposed method improves the mAP of DCA, DPSML. DSSH, HEAR by 3.98%, 2.71%, 3.60%, 1.58% on SHREC 2014 with ResNet-50 as backbone. Due to SHREC 2014 being more challenging than SHREC 2013, the experimental results have proven the effectiveness of our method.  Fig.3. The PR curves on SHREC2013. Fig.4. The PR curves on SHREC2014.

Conclusion
This paper presents a novel Guidance Cleaning Network for sketch-based 3D shape retrieval, which consists of 3 modules including knowledge distillation module, attention weighting module and data cleaning module. The training process can be divided into two stages， namely teacher network training stage and transfer learning stage. The experimental results demonstrate the superiority of our method.