Adopting multiple vision transformer layers for fine-grained image representation

Accurate discriminative regions proposal has an important effect for fine-grained image recognition. The vision transformer (ViT) brings about a striking effect in computer vision duo to its innate muti-head self-attention mechanism. However, the attention maps are gradually similar after certain layers and since ViT adds classification token for perform classification, it is unable to effectively select discriminative image patches for fine-grained image classification. To accurately detect discriminative regions, we propose a novel network AMTrans, which efficiently increases layers to learn diverse features and utilizes integrated raw attention maps to capture more salient feature. Specifically, we employ DeepViT as backbone to solve the attention collapse issue. Then, we fuse each head attention weight within each layer to produce attention weight map. After that, we alternatively use recurrent residual refinement blocks to promote salient feature detection and then utilize semantic grouping method to propose the discriminative feature region. A lot of experiments prove that AMTrans acquires the SOTA performance on three widely used fine-grained datasets under the same settings, involving Stanford-Cars, Stanford-Dogs and CUB-200-2011.


Introduction
Detecting discriminative regions are critical for fine-grained image recognition, which are challenging tasks due to the subtle yet vital feature learning.As the progress of neural network methods, the performance of fine-grained image recognition tasks achieves great upswing [1].Currently, weaklysupervision methods with only image-level label are popular approaches.There are two types of network backbones (i.e., CNN-based and ViT-based).The networks of ViT-based are easy to train, lower in complexity and more accurate in capturing subtle discriminant features, which make ViT more valuable in practice.
The models of CNN-based include two categories, i.e., localization and feature-coding approaches.Relatively, localization methods are more interpretable and easier to understand.The former usually trains a discriminative regions proposal network and reuse regions to achieve classification.RA-CNN [2] proposed recurrent attention CNN to recurrently learn attention maps in three scale.MA-CNN [3] employed channel grouping approach to generate multiple consistency feature vector by endto-end training.However, the attention numbers are hyper-parameters, which limited the productivity and flexibility of network.Liu [4] et al. proposed filtration and distillation learning network (FDL), which adopted knowledge distillation method to recurrently detect critical regions.Zheng [5] et al. proposed TASN, which utilized learning trilinear attention sampling network and a feature distiller module to strengthen discriminative regions.Unfortunately, these two networks are difficult to be trained, expanded and have high complexity.The latter methods rely on the deep feature representations to achieve better performance for FGVC.Yu [6] et al. proposed HBP method to do cross-layer bilinear pooling, which verified that the low-level features can compensate for the lack of an object structure feature in high-level semantics.Zheng [7] et al. proposed general block DTB, which used channel grouping method and group bilinear.Because DTB block keeps consistent feature dimensions between input and output, thus CNN may integrate it into any layers as long as necessary.However, with the increase of network depth, the networks are heavy and difficult to explain how to obtain the subtle salient regions.
Recently, some studies innovatively introduced transformer into computer vision tasks, which creates a new era for CV.Visual models of transformer-based developed rapidly from 2019 and there are many achievements worth recommending.Dosovitskiy [8] et al. presented vision transformer (ViT), which was the 1st to use transformer to solve computer vision tasks.However, it only employs classification token to detect categories, which is inappropriate for fine-grained representation.He [9] et al. presented TransFG, which developed a region selection method to propose discriminative region but it cannot generate multi-scale fine-grained classification features.To resolve this problem, Zhang [10] et al. presented AFTrans, which adaptively selects relatively sensitive patches for optimizing regions proposal.Wang [11] et al. proposed FFVT, which adopted feature fusion ViT to select the best significant tokens within each encoder layer as the inputs of the last layer.However, all of above studies are limited the depth of transformer encoder layers thus they only fuse narrow features.
To solve these problems, our research proposes a novel model AMTrans which employs reattention to replace multi-head self-attention mechanism to raise the depth of transformer encoder layers.Then, we utilize feature fusion method to enhance salient feature map.To be specific, we use DeepViT [12] to increase the number of layers.Concurrently, this research fuses all the attention weights within every transformer encoder layer and then integrate the shallow level features and deep level features as the input of recurrent residual refinement blocks (RRBs).Subsequently, the salient feature map will be output from RRBs, which is the input of channel attention module that will propose the most vital region of input image.Finally, the proposed region will be the input of our model again to achieve classification.The AMTrans outperforms existing networks on ImageNet as shown in Figure 1.The contributions of this research are as follows:  To capture more diverse features, we increase the depth of layers and fuse the attention weights within each transformer encoder layer. To saliently enhance the discriminative region proposal, we utilize recurrent residual refinement blocks to improve salient feature detection and then utilize semantic grouping method to select the excellent discriminative region on the input image. To be our best knowledge, this research is 1st to successfully use increasing the depth of layers for fusing more attention weight.

Method
This section introduces our model AMTrans, which consists of three parts (i.e., fusion attention weight, salient feature detection and discriminative region proposal).
An overview of AMTrans is shown in Figure 2. The backbone of our model is DeepViT [12], which focus on fusing attention weight within each transformer encoder layer and then utilizes R 3 NET [13] to reinforce salient feature for critical region proposal.In Figure 2, the image is divided into blocks of same size, which are the input of DeepViT.This study utilizes the hadamard product to fuse the multi-head attention weights of all layers according to the head and then generates attention weight map by concatenation them.Subsequently, we employ R 3 NET to achieve salient feature enhancement and then propose discriminative region.Finally, we cut and enlarge the selected region on the input image, which is the input of DeepViT.From Figure 2, it can be seen that the dimension of attention weigh map is  ∈  (B=bize_size, L= the number of transformer layers, K = the quantity of head, P =the count of patches).After that, we split the feature map into shallow level features (L) (i.e., from layer 1# to layer 16#) and deep level features (H) (i.e., from layer 17# to layer 32#) as the input of R 3 NET, which generates saliency feature map (the input and output dimension of R 3 NET is without changing).To the end, we utilize a CNN to propose critical region.

Fusion attention weight
To avoid attention collapse issue, we utilize a simple yet effective method DeepViT [12] to stack more transformer encoder layers to increase diversity of attention weights with negligible cost.DeepViT [12] adopted re-attention to replace self-attention mechanism.Specifically, the output of re-attention is: where  is normalization,  is softmax,  ∈  is a learnable transformation matrix,  is Query tensor,  is Key tensor,  is Value tensor, and  is the dimension of Key tensor.
In this paper, we use pre-trained DeepViT-32B [12] network, which is composed of 32 layers.To be our best knowledge, every head represents different region over the image within each layer.Thus we adopt element-wise product to fuse all the attention weights in every encoder layer according to K multi-heads to reinforce effective attention features.The attention weight map of the m-th head in each layer is as follows: 2 where P is the count of patches.Then we can do hadamard product on each head of all layers grouping by head.Thus the final attention weight map of the m-th head is as follows:  ∏ 3 where ∏ is hadamard product, N =the count of layers and  ∈  ( B= bize_size and P is the count of patches).
Thus the final fused attention weigh map is as follows:  concat  , . .,  4 where "concat" is concatenation operation, K is the number of head and  ∈  .

Salient feature detection
Detecting the subtle feature is the soul of fine-grained image representation, but it is a difficult task.
To resolve this matter, we take the R 3 NET [13] to enhance salient feature.The overview of R 3 NET is shown in Figure 3.In Figure 3, the original saliency map (S 0 ) is H, which is many times optimized by some residual refinement block (RRB).At the same time, from Figure 3, it can be observed that R 3 NET utilize integrated shallow level feature to capturing more saliency details, which compensate for the weakness that deep level features only rely on rich semantic features.As far as we know, the RRB can accurately propose salient feature regions on input image.An RRB is defined as:     ,     5 where  is CNN, "φ" is concatenation operation,  is the predicted saliency map of the (k-1)-th step and the feature map M is alternatively set as integrated shallow level feature or integrated deep level feature.In this work, we use three RBBs.

Integrating
In our research, because the count of encoder layers of DeepViT is 32, we set the range of shallow level layers are {1-15} and deep level layers are {16-32}.Capturing the critical and subtle region is core of fine-grained task.We utilize output of R 3 NET(S n ) as input feature for selecting discriminative region.The process of discriminative region proposal is shown in Figure 4.

Discriminative region proposal
From Figure 4, we take the semantic grouping (SG) to obtain the relative weight parameters of the regions and then use hadamard product with S n to reinforce what to pay attention to.Finally, we use largest connected selection method to select the best discriminative region.It can be observed that SG consists of channel grouping and intra-group strengthen.The outputs of SG can be denoted as: 6 where  is channel grouping method (fastcluster [14]),  is matrix product in each intra-group, and  is sigmoid.The output of SG denoted  .Hence, the refined feature can be denoted as: ⊗  7 where S n and  conduct Hadamard product and the dimension of T is same as S n .
Finally, this study utilizes the largest connected region selecting method to capture the excellent discriminative region from T to cut this region on the input image.In the end, this region is amplified as the input of the DeepViT.
Next, we initialize the parameters for the experiment.For fair comparison, the image size is 224× 224 and every patch size is 16×16.Batch size adopts a generic value 256.Then we adopt the trained DeepViT-32B network on ImageNet1k [18] and the trained R 3 NET network on MSRA10K.Concurrently, we employ SGD optimizer and fixed learning rate 0.0002.AMTrans is trained on two Tesla V100 GPUs with Pytorch as our code-base.Specifically, AMTrans also is pre-trained on ImageNet1k.

Performance comparison
To prove the performance of network, AMTrans compares with current SOTA approaches on three benchmarks.This section uses accuracy as an evaluation metrics.From Table 1, it can be observed that AMTrans gets the good results and surpasses CNN-based methods and ViT-based methods.
From the 4th column of Table 1, we know that AMTrans achieves 2.1% improvement than S3N [19] and surpasses all CNN-based methods.We argue that all methods can obtain better results on this dataset due to less background-noise over images.In terms of accuracy, our model brings 2.7% gains compared to TransFG [9].We believe that feature fusion and salient region proposal are main reasons.
From the 5th column of Table 1, we can know that vision transformer methods surpass CNNbased models.We analysis that the reason is the hard-to-find inter-class diversities between certain objects on Stanford-Dogs [16].Hence, it proves the advantage of vision transformer.However, AMTrans still gets the best performance and reaches 92.7%, which brings 1.7% gains compared to HAVT [20].

Ablation studies
This section shows the influence of every part in AMTrans by ablation studies.We conduct all the studies on CUB-200-2011 [15] and other datasets have the same phenomenon as well.
Our model adopts DeepViT [12], which significantly promotes diversity of features.The benefit it brings is shown in Table 2.
Table 2. Ablation experiment on different backbone.From Table 2, we observe that it is beneficial for improving performance by stacking the transformer encoder layers.We believe that the muti-head re-attention mechanism can increase discriminative feature diversity.Specifically, DeepViT_32B brings 1.5% improvement compared to ViT_32B.However, if we increase transformer encoder layers to 44, the gain of accuracy is only 0.3% and time complexity will increase.Thus this research employs 32 layers.

Backbone
Table 3. Ablation experiment on R 3 NET, discriminative region proposal.In Table 3, the DeepViT_32B is baseline.From Table 3, it can be observed that R³Net can bring 1.1% gains.Thus fusing all levels of features can be beneficial and reinforce the information of the region of interest.Concurrently, it can be observed that SG can brings 1.4% gains.We analysis that SG uses the channel grouping and intra-group strengthen, which can focus on discriminative informative features and suppress less useful ones.Hence, our model associates DeepViT with R 3 NET and SG.

Visualization experiments
We randomly select an image from each dataset and then do the visualization experiment the result as shown in Figure 5.To verify the strength of AMTrans, we conduct a comparative experiment.

Conclusion
This research puts forward a novel model AMTrans, which achieves SOTA performance on three datasets.To resolve the attention collapse problem, we employ the DeepViT.Thus, we can increase the depth of transformer encoder layers to obtain more diverse features and then fuse the attention weight within each layer to reinforce representation.Concurrently, we utilize multiple recurrent residual refinement blocks to prompt the discriminative features and suppress noise features.Finally, we adopt semantic grouping method to capture what to pay attention to select critical region.At the same time, AMTrans obtains promised result on three fine-grained benchmarks: CUB-200-2011, Stanford Dogs and Stanford Cars.In the future, we will conduct data fusion (e.g., internet data, videos, etc.) to obtain the progress of performance for fine-grained image representation.

Figure 1 .
Figure 1.The accuracy comparison of SOTA networks.

Figure 2 .
Figure 2. The architecture of AMTrans.In Figure2, the image is divided into blocks of same size, which are the input of DeepViT.This study utilizes the hadamard product to fuse the multi-head attention weights of all layers according to the head and then generates attention weight map by concatenation them.Subsequently, we employ R3 NET to achieve salient feature enhancement and then propose discriminative region.Finally, we cut and enlarge the selected region on the input image, which is the input of DeepViT.From Figure2, it can be seen that the dimension of attention weigh map is  ∈  (B=bize_size, L= the number of transformer layers, K = the quantity of head, P =the count of patches).After that, we split the feature map into shallow level features (L) (i.e., from layer 1# to layer 16#) and deep level features (H) (i.e., from layer 17# to layer 32#) as the input of R 3 NET, which generates saliency feature map (the input and output dimension of R 3 NET is without changing).To the end, we utilize a CNN to propose critical region.

Figure 3 .
Figure 3.The structure schematic of R 3 NET.In Figure3, the original saliency map (S 0 ) is H, which is many times optimized by some residual refinement block (RRB).At the same time, from Figure3, it can be observed that R3 NET utilize integrated shallow level feature to capturing more saliency details, which compensate for the weakness that deep level features only rely on rich semantic features.As far as we know, the RRB can accurately propose salient feature regions on input image.An RRB is defined as:     ,     5 where  is CNN, "φ" is concatenation operation,  is the predicted saliency map of the (k-1)-th step and the feature map M is alternatively set as integrated shallow level feature or integrated deep level feature.In this work, we use three RBBs.In our research, because the count of encoder layers of DeepViT is 32, we set the range of shallow level layers are {1-15} and deep level layers are {16-32}.

Figure 4 .
Figure 4.The overview of discriminative region proposal.

Figure 5 .
Figure 5. Visualization of discriminative region proposal on three benchmarks.From Figure 5, it can be observed that ViT-based methods can capture better discriminative parts with subtle critical features than CNN-based methods.Meanwhile, AMTrans obtains the most discriminative region as shown in the 5th row.