Spatial-Semantic Transformer for Spatial Relation Recognition

Spatial relation recognition, which aims to predict a spatial relation predicate, has attracted increasing attention in the computer vision study. During tackling this problem, modeling spatial relation of the subjects and objects is of great importance. We find that only using spatial features leads to poor results in predicting the spatial relation. To overcome these challenges, we propose an effective spatial attention module to enhance spatial features using semantic features. After identifying the importance of spatial attention mechanism, we propose a spatial transformer module with encoder layers to recognize unseen spatial relation based on spatial attention mechanism. Extensive experiments on the benchmark dataset (SpatialSense) show that, by using refined spatial feature, our spatial transformer model and spatial attention model achieve state-of-the-art performance on overall accuracy.


Introduction
In recent years, Spatial Relation Recognition (SRR) has been a hot topic in the computer vision community.Recognizing spatial relations is crucial to a richer semantic understanding of the visual world, and benefits a wide range of tasks, e.g., object detection [1] [2] and human-object interaction recognition [3] [4].In spatial relation recognition, the spatial relation predicates such as "on" and "to the left of" enable to be recognized as seen in Figure 1 and the task is formulated as a classification problem of spatial relation predicate when two objects in an image are given.Spatial relation recognition in images is a fundamental challenging task due to the semantic of spatial relations are rich and complex.The spatial semantic between objects depends not only on geometric properties such as location and direction, but also the common sense knowledge (e.g."the fish on the water" does not dovetail with common sense while "the fish in the water" does).By getting spatial semantic, robots acquire spatial cognition for improving objectives in real world environment to complete more complex tasks.
In order to enhance the performance of spatial relation recognition, recently, some of the state-ofthe-art works rely on Convolutional Neural Network (CNN) architectures where a model is trained from the coordinates of the bounding boxes of the objects (visual features), and potentially their categories (e.g., chairs, guitars) called language features, to predict the spatial relations.Besides visual and linguistic knowledge, spatial features are also important cues for reasoning over spatial relations in images.To obtain spatial relations among pairwise objects, a spatial module with bounding boxes coordinates is proposed to improve the performance [5] [6] [7].However, these Recently, attention mechanism has shown promising performance in relation recognition since it can refine the feature and lead to farther improvements [6], [8], [9].However, these methods do not attempt to use binary maps and convolutions operations to extract attention features.Further, they ignore to leverage attention weights which encode spatial-semantic relation to refine the spatial feature.Specifically, we propose a spatial attention module to encourage the network to focus on discriminative regions by leveraging the semantic-spatial knowledge of subject-object pairs.We utilize a spatial transformer module with encoder layers to achieve superior performance in spatial relation recognition.Spatial attention is utilized in spatial transformer to exploit the informative spatial cues.Moreover, motivated by [10], we define a spatial relation graph for modeling spatial relation.Nodes in this graph are the subjects and objects, while the attention weights generated from the spatially refined visual feature as the importance of the edges in the graph.
In summary, our contributions can be summarized into three aspects as follows:  We propose a spatial attention module that utilizes binary image representation to extract complex information of bounding box and leverages the semantic knowledge of subject-object pairs such that spatially relevant subject-object pairs are amplified. We build a spatial transformer module with encoder layers to aggregate the spatial-semantic features and help with knowledge transfer among object classes and exploit the spatial relation graph for modeling spatial relation. We show through extensive experiments that the spatial attention module and the spatial transformer module effectively boost the model performance, and achieve competitive results on benchmark dataset.

Related Work
Taking spatial relation into account in image analysis processes has been a hot topic studied by the computer vision community.Early researchers used fuzzy model-based approaches to represent spatial constraints among objects.Fuzzy set is crucial to take into account the imprecision inherent to images [11], [12].After that, spatial descriptor-based approaches were proposed to model spatial relation which take fuzzy model into account.Among the spatial relation recognition methods based on spatial descriptors, the angle histogram [13] and force histogram [14] are the most basic descriptors, and many other spatial recognition methods are based on these descriptors.Angle histogram handles the objects as points, and then calculates the angles between the line connecting two points (one in each object region) and the horizontal line.Instead of studying pairs of points, force histogram processes the objects into longitudinal sections.Ratio histogram [15] and spread histogram [16] are based on angle histogram, while ϕ descriptor [17], multilevel polygonal descriptor [18] and force banner [19], all rely on force histogram to describe spatial relation. Recently

Spatial-Semantic Transformer for SRR
In this section, we introduce the key components of the model, including the spatial attention and the spatial transformer module, spatial relation graph and graph convolutional inference module, and spatial enhancement module.The overall framework of the proposed spatial relation recognition method is shown in Figure 2. The model consists of five main branches.Binary Image Representation characterizes the overlap or intersection relations among objects by constructing two binary maps based on bounding boxes and then the spatial relation feature is extracted using CNN.Spatial Attention Module refines spatial features by leveraging spatial semantic knowledge and attention mechanism.Xors as input is fed into a self attention network and is projected into a query, key, value embedding space, query, key, value individually.Spatial Transformer Module is employed to gain the global-context information and model the pairwise spatial relations whose core component is the selfattention module as defined above.In Graph Convolutional Network(GCN), spatial relation graph is utilized to aggregate the spatial and semantic features and generate effective features for subjects and objects by a GCN.Spatial Feature Extraction builds three kinds of spatial features to refine the final semantic-spatial features with spatial information represented by their bounding boxes.

Binary Image Representation
Binary Image Representation is widely utilized to model the spatial relation [4], [10], [27]- [29].It characterizes the overlap or intersection relations among objects.Given the bounding box, two binary maps are generated to model the spatial relations.Specifically, the union of these two bounding boxes is taken as the reference box and scaled to a fixed size.After that, a binary image is constructed with two channels within it.The first channel has value 1 within the human bounding box and value 0 elsewhere; the second channel has value 1 within the object bounding box and value 0 elsewhere.We then extract spatial relation features using two layers of convolutions and max pooling.Figure 3 describes the process of spatial feature generation using binary image representation.First, subject-object proposals are extracted to model spatial relation among object pairs.Second, we focus on the union of these two bounding boxes called attention windows.Three, the union is scaled to a fixed size by removing contexts outside the focused union.Four, two binary maps are generated to capture interaction pattern between subject and object.Finally, the convolution layers and pooling layer are utilized to extract final spatial feature.
The following formula describes the process of spatial feature generation based on binary image representation: where Sos is a spatial feature and Bos is a two channel binary spatial relation map between object o and subject s.Flat operation is defined by us which changes the spatial feature to a two dimensional feature.
Figure 3. Process of spatial feature generation using binary image representation.

Spatial Attention Module
We find that only using spatial features leads to poor results in predicting the spatial relation.To address this issue, an effective spatial attention module is designed to enhance spatial features using semantic features.
We first concatenate the spatial feature and semantic feature by constructing the feature Xors: where Ls is semantic feature of subject and Lo is semantic feature of object.Xors as input is fed into a self attention network.Three projection vectors called query Wq , key Wk and value Wv project feature into a query, key, value embedding space, query, key, value individually.
After that, we calculate the attention using scaled dot-product, normalized by: where key T represents transpose of key and dk is dimension of key and query, the dot represents vector dot-product.After transpose, scaled dot-product is applied to query and key and a operation Softmax that normalizes the results into probability distributions is executed.We augment the spatial feature by: where matmul is a multiplication operation.We augment the spatial feature with the word embedding of category of each object using self attention.The Spatial attention module enables knowledge transfer among object spatial semantic and helps with spatial relation recognition during training and inference.

Spatial Transformer
We employ a standard transformer encoder structure [30] to gain the global-context information and model pairwise spatial relations.The transformer encoder layers are to mapping all input features into a new representation that holds the global information for that final semantic-spatial feature.
The transformer encoder layer is a stack of multi-headed self-attention and feed forward networks, which is composed of two linear transformation layers.Its core component is the self-attention module as defined in Eq. (6).In each transformer encoder layer, the multi-head self-attention computes the attention weights for the input and produces an output vector with encoded information.The multi-headed attention output vector is fed into a feed forward network and produces output of each layer.
We first concatenate semantic feature of subject Ls, spatial feature Sos, and semantic feature of object Lo as Xors.We then feed Xors into the global-context encoder which leverages the spatial relation of subject-object pairs and refines the language features such that spatially relevant subject-object pairs are amplified.

Spatial Relation Graph and Graph Convolutional Inference Module
This module utilizes spatial relation graph to aggregate the spatial and semantic features and generate effective features for subjects and objects by a graph convolutional network.
In spatial relation graph, we define the subjects and objects as nodes and use the spatial attention feature as the edge adjacency.This allows the edges to utilize the spatial relation between subjectobject pairs and generates better features.
After that, a Graph Convolutional Network (GCN) is applied to perform reasoning on the spatial relation graph.Followed by [10], we traverse and update the nodes in the spatial relation graph using their edges.Given the subject feature fs, object feature fo, and edges connecting the subject and object, graph feature f s ′ and fo ′ are defined as follows: where aos and aso define the adjacency between s and o.
The final semantic-spatial feature fsg is encoded by two linear weighted combination of the pair graph features f s ′ and fo ′ : where Wg2 and Wg1 are two matrices, Norm is batch normalization operation.

Spatial Feature Extraction Module
The spatial feature extraction module is designed to refine the final semantic-spatial features with spatial information represented by their bounding boxes.Even though RVL-BERT [9] achieves good results for recognizing spatial relation, it pays no attention to size information of bounding box.Our module captures not only the location information but also size information of a pair of objects.The bounding box information of each object is treated as a 4D vector: (tlx, tly, brx, bry), where tlx and tly represent the horizontal and vertical coordinates of the top left corner of the bounding box in the image, brx and bry represent the horizontal and vertical coordinates of the bottom right.
Then, we build three kinds of spatial features to represent spatial information of bounding box.The first spatial feature takes into account width and height.Followed by [3], [7], [8], [34], a 4D vector (tlx, tly, w, h) as spatial feature is formed, where w and h denote the width and height of the bounding box.The second spatial feature integrates position and width as well as height information, which an 8D vector (tlx, tly, brx, bry, xcenter, ycenter, w, h) is utilized to represent [2].Our third spatial feature considers the area of bounding box which is a 5D vector represented as [25] [35] [6] [36], where W and H are the width and the height of the image, A and Aimg are the area of the objects and the image, respectively.
Spatial features of subject and object are fed into spatial module and are encoded using a two-layer, fully-connected layer respectively.

Experiments
The use of sections to divide the text of the paper is optional and left as a decision for the author.
Where the author wishes to divide the paper into sections the formatting shown in table 2 should be used.

Dataset
SpatialSense [32] is the recently collected spatial relation recognition benchmark, which contains 17,498 spatial relations in 11,569 images.All pictures are collected from Flickr and NYU Depth [33].
The annotated spatial relations in the dataset contain 3679 unique object classes and 9 unique predicates.

Experimental Details
We refine the spatial feature with the word embedding of objects' label which is a 300 dimensional vector.In attention mechanism, projection vectors query and key and value project refined spatial feature into 1024 dimensional vectors.And then, the spatial attention module uses the 2 attention mechanisms to extract attention weights which augment the spatial features.Moreover, to enhance the generalization capability of the model in the spatial attention module, we use 2 dropout operations with probability 0.5.In the process of training, testing and verification, the final semantic-spatial features have different shapes when utilizing spatial relation graph.Based on the shape difference of features, we designed three kinds of features to encode the final semantic-spatial feature.Spatial features of subject and object are fed into spatial module and are embedded into 768 dimensions using dimensional linear layers.All models are trained on Linux with a single NVIDIA RTX 2080 Ti GPU.We trained our model for 45 epochs for SpatialSense dataset.

Comparison of Results
We make a comparison between our model and a variety of recent methods in SpatialSense dataset.ViP-CNN [31] presents a Phrase-guided Message Passing Structure (PMPS) to establish the connection among relation components.Peyre et al.
[5] develops a new powerful visual descriptor for representing object relations and a weakly-supervised model for learning object relations.PPR-FCN [6] uses the role of context to recognize the object interaction pattern.DRNet [28] integrates a variety of cues: appearance, spatial configurations, as well as the statistical relations between objects and relation predicates for visual relation detection.VTransE [7] is an end-to-end and fullyconvolutional architecture and offers a comprehensive scene understanding for connecting computer vision and natural language.Language-only, 2D-only and Language + 2D are baselines in [32] taking simple language and/or spatial feature into account.Language + 2D + Depth and Language + 2D + Depth + Disp.are baselines in [22] exploring depth information in spatial relation recognition.Ding et al. [22] also proposes DSRR model by building a three-stream module to exploit language, 2D location plus depth, and relative displacement cues.RVL-BERT [9] exploits visual commonsense knowledge in addition to linguistic knowledge learned to boost visual relation detection.We propose Spatial-Attention model an Spatial-Transformer model based on RVL-BERT [9] by using spatial cues to improve spatial relation recognition.
As we can see from Table I, our method can maintain a better performance and show certain advantages.Our Spatial-Attention and Spatial-Transformer achieve 72.9% and 73.3% accuracy, respectively, which is better than other methods.Compared with RVL-BERT [9], our Spatial-Transformer model improves the accuracy by 3.8% in recognizing the spatial predicate "above", 0.4% in "behind", 3.0% in "in", 1.6% in "next to", 0.4% in "on", 2.2% in "to the left of" and 2.8% in "under".

Analysis of Spatial Transformer Encoder Layers
In order to validate the effect of transformer encoder layers, we experiment with different numbers of transformer encoder layers (1, 2, 4 and 6).SE Model represents the spatial feature extraction model.Basic means spatial module in [9].SE1 stands for modeling spatial relation using .SE2 represents the spatial relation extraction using an 8D vector (tlx, tly, brx, bry, xcenter, ycenter, w, h) .SE3 means that a 4D vector (tlx, tly, w, h) as spatial feature is formed to predicate spatial relation.The last row of Table II shows that one encoder layer performs slightly better than addition under all Basic values.As the transformer encoder layers are fewer, the better performance is given.Moreover, spatial transformer with encoder layers boosts more recognition accuracy than baseline since it incorporates the information of bounding boxes and semantic knowledge of the object category to capture spatial relations.

Analysis of Spatial Feature Extraction
The spatial feature extraction module is utilized to emphasize the information of bounding boxes and capture the distinct spatial information.Table IV shows the influence of different spatial feature extraction on model performance.SE1 and SE2 give a better performance and show certain advantages in recognizing "above", "behind" and "in", while SE3 performs slightly better than Basic model in recognizing "in".Spatial extraction module enables to capture spatial representation such as "above" ,"behind",and "in" of subjects and objects, which are of importance in spatial relation recognition.1) is better than other methods under basic conditions and outperforms baseline by relative 1.0% on overall accuracy.We also find the spatial attention module provides a relative 0.6% boost of accuracy.Model with attention features are more effective as attention features are used to refine the spatial features by amplifying the pairs with high spatial-semantic correlation.

Conclusion
In this paper, we propose a spatial attention module to enhance spatial features by leveraging the semantic-spatial knowledge of subject-object pairs.After identifying the importance of attention mechanism, we propose a spatial transformer module with encoder layers to recognize unseen spatial relation.Moreover, we investigate various ways such as spatial relation extraction module and spatial relation graph construction to derive the spatial representation for spatial relation recognition.We conducted extensive experiments on SpatialSense.Experiments demonstrated that our spatial transformer model and spatial attention model can achieve the state-of-the-art performance.

Figure 1 .
Figure 1.Task of spatial relation recognition in images.methods only exploit position information of bounding boxes to describe the spatial relation.Instead, we take into consideration both bounding boxes' simple position information and bounding boxes' complex overlap or intersection relation in modeling spatial relations.We propose a spatial relation extraction module based on the bound box information to extract spatial features considering size, locations as well as area of the bounding boxes in the image.Moreover, we adopt the two-channel binary image representation to characterize the spatial relations.
Spatial Relation Graph enable to aggregate the spatial and semantic features.In order to validate the effect of spatial relation graph construction, we design three kinds of spatial relation graph to examine their recognition performance.Language and Spatial represent that the nodes of spatial relation graph are language features and spatial features, respectively.Language-Spatial stands for concatenation of language features and spatial features.SG Model means spatial relation graph model.TableIIIshows the influence of different spatial relation graph construction on model performance.The spatial relation graph whose nodes are spatial features further boosts the model by 0.7% (accuracy of "above"), while the spatial relation graph based on Language-Spatial features lifts the Basic model by 2.4% (accuracy of "above").Moreover, the spatial relation graph relying on spatial feature increases the accuracy of "next to" by 1.3%.Spatial relation graph whose nodes are enhanced spatial features is more easily to capture spatial predicate.

[ 1 ]
Xu H, Jiang C, Liang X, and Li Z. 2019.Spatial-aware graph relation network for large-scale object detection.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 9298-9307 [2] Hu R, Xu H, Rohrbach M, Feng J, Saenko K, and Darrell T. 2016.Natural language object retrieval.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,, pp 4555-4564 [3] Fang H S, Cao J, Tai Y W, and Lu C. 2018.Pairwise body part attention for recognizing human-object interactions.Proceedings of the European conference on computer vision (ECCV), pp 51-67 [4] Hou Z, Yu B, Qiao Y, Peng X, and Tao D. 2021.Affordance transfer learning for humanobject interaction detection.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 495-504 [5] Peyre J, Sivic J, Laptev I, and Schmid C. 2017.Weakly-supervised learning of visual relations.IEEE international conference on computer vision, pp 5179-5188 [6] Zhuang B, Liu L, Shen C, and Reid I. 2017.Towards context aware interaction recognition for visual relationship detection.IEEE international conference on computer vision, pp 589-598 [7] Zhang H, Kyaw Z, Chang S F, and Chua T S. 2017.Visual translation embedding network for visual relation detection.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5532-5540 [8] Han C, Shen F, Liu L, Yang Y, and Shen H T. 2018.Visual spatial attention network for relationship detection.Proceedings of the ACM international conference on multimedia, pp 510-518

Table 1 .
[22]ormance comparison with existing models on SpatialSense dataset.Results of the existing methods are extracted from[22]and respective papers.

Table 2 .
The influence of number of transformer encoder layers on model performance.

Table 3 .
The influence of different spatial relation graph construction on model performance.

Table 4 .
The influence of different spatial feature extraction on model performance.

Table 5 .
The comparison of spatial feature extraction, spatial attention and spatial transformer.
Spatial Attention, Spatial Transformer and Spatial Relation Graph Spatial Attention, Spatial Transformer and Spatial Relation Graph all utilize binary image representation to express the spatial relation.We experiment in different ways to compare performance.SE represents spatial relation extraction and SA stands for spatial attention.ST(1) means spatial Transformer with 1 encoder layer, while SG (Spatial) represents spatial relation graph whose nodes are spatial feature.As we can see from Table V, ST(