Explainer on GNN-based segmentation networks

Graph Neural Networks (GNN) are powerful tools for deep learning. Similar to other neural networks, GNNs are complex models, in which humans can’t understand the decision-making procedures of the models. Therefore, it brings the need to explainability of GNNs. Explainability is critical for deep learning to support its predictions. In this paper, we will investigate the Grad-Cam and Integrated-Gradients explaining methods. The Grad-Cam applies a global average pooling over the feature activation mapping, and then which was followed by a ReLU activation to obtain an attribution. The Integrated-Gradients explains models by taking a line integral between the baseline image (a black image) and the source image. We demonstrate how Grad-Cam and the Integrated-Gradients methods explain the graph-deep model in semantic segmentation tasks over the Cityscapes dataset. FCN and LRASSP-MobileNet are used as a comparison to the DualGCN in the experiment to show the explaining effect.


Introduction
In the past decades, Graph neural networks (GNNs) have significantly advanced the capability of computers to learn from graph-structured data.Typical tasks of GNNs include graph classification, node classification, link prediction, etc.They have been well applied in a wide range of domains, such as transportation prediction, molecule classification, quantum chemistry, and recommender system [1].Recently, different kinds of architectures for GNN were proposed to utilize the graph data.For example, graph convolutional networks [2], recurrent graph neural networks, spatial graph convolutional networks, temporary graph convolutional networks, graph attention networks, etc.
Similar to other deep models, these proposed deep graph models have great learning capabilities to capture the underlying patterns in the data.However, as with other deep models, GNNs are also treated as black boxes, in which we have no clue on how and why certain predictions are provided.Therefore, their predictions don't have supportive explanations.This drawback is severe in decision-critical domains.Without strong evidence and explanations, predictions of GNNs cannot be considered.Therefore, explaining methods for deep graph models is necessary Explainability has been the central question for deep neural networks.The current state of the art of explainability in graph neural networks can be divided into two groups: Instance Level Explanation and Model Level Explanation.[3].Firstly, instance-level explanations give explanations based on individual input.Specifically, an input graph is given to the deep graph model, and explanations will be generated according to the given input.Secondly, model-level explanations generate input-independent explanations.Precisely, they provide insight into the general behavior of deep graph models without regard to inputs.
In this work, we will explain the DualGCN model applied in the image segmentation domain.Specifically, the DGCNet applied on the CityScapes dataset for image segmentation will be explained by the Grad-Cam and Integrated Gradients.
The Grad-CAM method extracts semantic meaning from the last convolutional layer.Specifically, it first computes the gradient of a class c with respect to the feature activation map, and then it applies a global average pooling over width and height dimension to obtain the neuron importance.Finally, a ReLu activation will be applied to the product of neuron importance and feature activation mapping.
The Integrated Gradients method computes the contribution of each pixel to the output of a neural network.Precisely, a path from the baseline image, usually a black image, to the actual image will be computed, and then an integral of gradients of the output with respect to the input image along the path will be taken for the attribution of input features.

Explainability in graph neural network
The instance-level explanation consists of four types of methods: gradients/features-based methods, perturbation methods, surrogate methods, and decomposition methods.
The gradients/features-based methods generate explanations based on gradients of the approximated function to the deep graph models.Many methods are used to explain deep graph models, including Sensitivity Analysis [4], CAM [5], and Grad-CAM [5].The difference lies in the mathematical operations applied to the approximated functions of deep graph models.On the other hand, the perturbation methods study the effect of input perturbation on output variations.Firstly, important input features are represented by different masks depending on the tasks, and secondly, a new graph is produced by combining the generated masks with the input graphs.Finally, masks are evaluated and mask-generating algorithms are updated by passing the new graph into the trained graph neural networks.Common perturbation-based methods include GNNExplainer [6], PGExplainer [7], and SubgraphX [8].Finally, the difference lies in mask-generating algorithms, types of masks, and the objective function.
Surrogate methods aim to use simple and interpretable models to explain deep models for a local dataset around the input data.GraphLime [9], PGM-Explainer [10], and RelEx [11] are common surrogate methods for explaining deep graph models.The main difference between the mentioned methods lies in how the local datasets are obtained and which surrogate models to use.Finally, decomposition methods are used to decompose the predictions into several terms, and the importance scores of input features are represented by these terms.Currently, Layer-wise Relevance Propagation [12], BP-Excitation [5], and GNN-LRP [1] are employed to explain deep graph models.Intuitively, to understand these methods, score decomposition rules are used to distribute importance scores layer by layer from the output layer to the input layer.The importance scores obtained in the input layer are regarded as the explanation.

Graph neural network
Graph Neural Networks are a type of artificial neural network for processing graph data [13].Wu et al. [14] categorized GNN into four categories: Recurrent Graph Neural Networks (RecGNNs), Convolutional Neural Networks (ConvGNNs), Graph Autoencoders (GAEs), and Spatial-Temporal Graph Neural Networks (STGNNs).RecGNNs learn node representations by employing recurrent neural architectures.RecGNNs were the pioneers, contributing conceptually to many later proposed graph neural architectures such as ConvGNNs.ConvGnns generalized the convolution operation in grid data to graph data by using the graph Laplacian, which is a matrix representation of the graph.Additionally, GAEs are unsupervised learning, and the nodes/graphs are encoded as a latent vector space.Finally, STGNNs learn the underlying patterns from spatial-temporal graphs, which are essential for some spatial and temporal-dependent domains.

Image segmentation
Image segmentation aims to separate an image into several semantic meaningful regions by labeling each pixel, depending on the object's appearance [15].Deep neural networks have significantly improved this field.For example, Fully Convolutional Neural Networks [16], which removed all fullyconnected layers to produce a spatial segmentation map instead of classification scores, but objects in images were not localized.Therefore, Chen et al. proposed CNNs with graph models which combined CNNs with Convolution Random Fields (CRF) to localize segment boundaries to tackle the issue [17].Additionally, Visin et al. proposed ReSeg, an RNN-based model, for semantic segmentation [18].These are only a small subset of the vast field of image segmentation, but they all missed one aspect of the task, the relationships of objects in the images.Deep graph models are advanced to fill the gap with nodes and edges representing objects and relationships.Many GNN-based models have emerged recently.For example, DGCNet uses a dual GCN framework to model input graphs [19].Additionally, DGMN predicts node dependencies by dynamically sampling the neighbors of a node [17].On the other hand, Li et al. enables learning from the original features space and fully extract relationships across different layers by an improved Laplacian formulation [20].Finally, GNN is a fast-growing field, and this paper can only discuss limited aspects of GNNs.In addition, a more comprehensive review of GNN is given by Chen et al. [15].

Overview
In this paper, we explained the DualGCN model by using Grad-Cam, Integrated Gradients, and LRP methods.In section 3.2, the overall architecture of the DualGCN model will be discussed, and in sections 3.3-3.5 details of the Grad-Cam, Integrated-Gradients, and LRP methods will be introduced.

DualGCN model
The DualGCN model consists of a ResNet for feature extraction and two GCN, feature space GCN and coordinate space GCN, for spatial and feature understanding of objects in images.The coordinate space GCN enables coherent predictions of images by modeling the spatial relationships between pixels in the input images.On the other hand, assuming later layers in the network capture feature and a high-level understanding of the image, the feature GCN models the interdependencies of feature maps in the networks.For a more detailed discussion, refer to Zhang et al for a more detailed discussion on the model architecture of DualGCN [8].The Grad-Cam method uses the final convolutional layer as the target layer for an explanation.The final convolutional layer, which contains both semantic and spatial information of object parts, is expected to contain the most useful information; whereas the fully-connected layer lost the spatial relationships among objects.The Grad-Cam utilizes gradients retained by back-propagation to assign the contribution score of each neuron for a class of interest.

Grad-Cam
Formally, we first compute gradients of class c   with respect to feature map activation   of the final convolutional layer.Secondly, these gradients will be global-average-pooled over the width and height dimension for the neuron importance weights ɑ   ɑ   = (1/)       / ,

𝑘
The summation is equivalent to successive matrix products of the gradient with respect to the feature activation function and weights matrices until the target convolutional layer.Therefore, the weight ɑ   represents a partial linearization of the deep neural network starting from the feature activation matrix A and contains the importance score of the feature mask for a target class c.Finally, a weighted combination of the weights ɑ   and the activation map   is applied to a ReLU activation function.

Integrated-gradients
The Integrated Gradients method computes a straight line integral between the input and the baseline input.Formally, let a F be a function such that :   → [0,1] which represents a deep network.Specifically, suppose  ∈   is the input image of the network function, and let ′ ∈   be the baseline input (a black image).Secondly, consider a straight line between  and ′, and then compute the gradients at all points along the line.Specifically, integrated gradients are defined as the path integral of the gradients along the straight path from the baseline x' to the input x.The integrated gradients along the  ℎ dimension is for an input x and a baseline input x' is defined as, where ()/  is the gradient of F(x) along the  ℎ dimension.Furthermore, the sum of integrated gradients along all channels is equivalent to the difference between the function output of input x and that of the baseline x', which is demonstrated by the following equation.

Dataset
We conduct experiments on the Cityscapes dataset, which is an open source semantic scene understanding dataset, containing street scenes from 50 different cities.Models used in the experiment are pre-trained on the Cityscapes dataset.

Settings
The experiments were performed on the Google Colab Notebook, and T4 GPUs with standard 15GB RAM were used.Besides the discussed DualGCN model, the Fully Convolutional Network and the LRASSP-MobileNet were used in the comparison (citation) [16].FCN is a type of neural network commonly used for semantic segmentation tasks, and LRASSP-MobileNet is a semantic segmentation architecture on mobile devices [21].All models selected in the experiments are pre-trained.After each model had made a prediction for segmentation, Grad-Cam and Integrated Gradients are applied to explain the prediction by attributing to each pixel.The methods were set to attribute for class road, which is class 0.

Comparison of results
The motivations of the experiment are to explain predictions of deep models and to show the effect of explaining the methods on different models.

Conclusion
The recent advancement in graph neural networks has drawn significant attention from academics and the public.The increasing capability of deep graph models has also increased the model complexity, making humans unable to understand the decision-making mechanism of the deep model, so the explainability of deep graph models becomes necessary.We showed the effect of Grad-Cam and Integrated-Gradients applied on DualGCN, FCN, and LRASSP-MobileNet.The DualGCN has demonstrated clear attribution among pixels; whereas the FCN and the MobileNet showed obscure segmentations in the experiment.

4. 3 . 1 .
GradCam experiment.The result of the Grad-Cam method applied to three models on the Cityscapes dataset is shown in Figure3.The DualGCN demonstrated precise segmentations over four examples, in which green pixels are attributed by Grad-Cam as relevant to predictions.On the other hand, FCN attributes most pixels in the images as relevant, which gives an obscure attribution.Similarly, LRASSP-MobileNet depicts object shapes but still marks an unclear boundary in the images.Potential causes of the unclear boundary are firstly FCN and the MobileNet need more epoc of training to obtain a precise segmentation, since the predictions of these two models were obscure, resulting in the explanations being unclear.Secondly, the implementations of FCN and MobileNet are not as compatible as DualGCN to the Grad-Cam method, so the information back-propagation was not selecting the appropriate pixels.

Figure 3 .
Figure 3. Experiment of Grad-Cam on different Cityscapes images.The first row is the source image.The second row is the DualGCN model.The third row is the FCN model.The fourth row is the LRASSP-MobileNet.

Figure 4 .
Figure 4. Experiment of Integrated-Gradient on different Cityscape images.The first row is the source image.The second row is the DualGCN model.The third row is the FCN model.The fourth row is the LRASSP-MobileNet.