An evaluation method with convolutional white-box models for image attribution explainers

Research in the explanation of models has developed rapidly in recent years. However, research in the evaluation of explainers is still limited, with most evaluation strategies centered around perturbation. This approach may produce incorrect evaluations when the relationship between features gets complicated. In this research, we present a ground-truth-based evaluation method for feature attribution explainers on image tasks. We design three perspectives to evaluate. The input perspective evaluates whether the explainers accurately represent the inputs that the model detects. The feature perspective evaluates whether the explainers capture the features important in decisions. The user perspective evaluates the reliability that the user derives from the data. Then, using the traditional white-box model, we extract the ground truth corresponding to the three perspectives and provide an example to demonstrate the procedure. To acquire the results of the image attribution explainer, we also reconstruct the traditional white-box model into the convolutional network white-box model. Our method provides an a priori benchmark that is not affected by the explainer. The experiments show that we may use the evaluation method for different tasks and extend it to natural datasets, which offers a flexible and low-cost evaluation strategy.


Introduction
The black-box model's internal complexity makes it challenging to understand how decisions are made by it.A model with good performance is not always reasonable.Many models rely on the distribution of training data, producing spurious associations that one would not expect.The model will presume that dogs are strongly connected with the lawn if all of the input images of dogs show the grass.When the model comes across a dog in water, it might misidentify the animal.Because of the confidence problem that has resulted, it is challenging to apply these complicated models to fields that deal with security issues, such as economics, medicine and the law.
Researchers have proposed numerous explanation methods to solve this problem.A typical approach is to present the attribution of input features.It provides a post-hoc, local explanation, which means that it generates the explanation of a single input sample for a trained model.We call this technique the attribution explainer.The target of our evaluation method is the image attribution explainer, such as Gradient, Class Activation Mapping and Layer-wise Relevance Propagation (LRP).
The evaluation of image attribution explainers can be done by perturbing.For example, remove an input feature that the explanation believes important.We can verify that if it is really matters to the prediction.This approach does not know the true relationship between the features; it is merely a validation tool that depends on the explainer's findings.If either of the two features can assist the model in classifying, the prediction may be unaffected by removing either of them.Moreover, the removal may cause a variation in the data distribution.Therefore, the change in the prediction may be not only related to the feature.
The evaluation of the explainer is concerned with the reasoning process behind the model's prediction [1] and its understandability for stakeholders [2].Because of this, we suggest an explainer-independent evaluation approach with three evaluation perspectives, namely the input perspective, the feature perspective and the user perspective.The first two evaluate how faithfully the explanation follows the decision-making process.The input perspective focuses on whether the explainer can fully demonstrate the part of the input that is involved in the model's calculation.The feature perspective focuses on whether the explainer can capture the features that influence the final decision.The last one evaluates the psychological effect that the explanation has on the user.It focuses on whether the explainer meets the user's intuition.
To construct the explainer-independent method, we need to obtain the ground truth of the model without the use of explaniners.However, the only way to comprehend a black-box model is to rely on an explainer.We choose to build a white-box model for the evaluation.The ground truth is equivalent to the model's true label on the explanation.Comparing explainer-generated explanations with the ground truth makes it possible to evaluate the explainers.
In the study, we quantify three different types of ground truth that correspond to the various perspectives on the traditional white-box model.Then, we reconstruct the traditional white-box model into a convolutional network white-box model, which is applicable to the image attribution explainers.The convolutional model does not require training because its internal structure and parameters are set exactly after the traditional model.We can considerate that the two models are different forms of the same algorithm.The ground truth on the traditional model can also be regarded as the ground truth of the convolutional model.Finally, we design a metric called non-overlap to calculate the distance between the ground truth and the explainer results.
In Section 2, we discuss the work related to the evaluation of explanations.In Section 3, we describe our evaluation method and provide a concrete example of quantifying ground truth on the three evaluation perspectives.The quantitative evaluation metric is also introduced.In Section 4, we apply the evaluation method to three image attribution explainers, including Occlusion [3], Saliency [4] and LRP [5].Also, we set up the corresponding convolutional white-box model and display the findings of the evaluation.The conclusion is offered in section 5.

Related work
There are different views on the evaluation of the explanation.Some believe that the explanation is understandable when seen [6].Such a perception is included in the presentation of the explanations and the evaluations [4].They demonstrate outcomes that are in line with human psychological expectations, showing the superiority of the explanation.Often, these are qualitative evaluations.Some contend that when evaluating, it is important to distinguish between several viewpoints, such as faithfulness, plausibility, readability, etc.The research also points out that human judgement should not be involved in the evaluation of faithfulness [1].In conclusion, the user and the model are often the two key factors involved in the evaluation of explanations.
An evaluation approach has been put forth that makes use of the ground truth.It is usually concerned with the model aspect.The concept of ground truth is described as the feature of the input that is in fact important to a model [7].It is a benchmark for an explanation.By comparing the ground truth and the result of the explainer, the evaluation can be completed.As the absolute importance of features is not available, they turn to the relative importance of features.It can be controlled by varying the frequency of certain features in the dataset.The researchers have constructed artificial datasets with combined foreground and background information in their study.Others have also created artificial datasets to modify the relative importance of features and to derive the ground truth of the model [8].Building white-box models, such as generalized additive white-box models [9] and LSTM white-box models [10], is another method for obtaining ground truth.The perception of the white-box model can be a kind of ground truth.Similarly, four transparent models have been designed by using simple linear models and decision tree models for different data types and different types of explanations, combining to form a system for evaluating explanations [11].
We also provide a ground-truth-related evaluation method.Different from other methods, our method does not limit itself to the model but also takes into account the user.In addition, most of the ground truth in the aforementioned methods is a qualitative description, while our method can obtain a clearly defined ground truth.

Ground truth quantified by traditional white-box model
There are three different kinds of ground truth, including input ground truth, feature ground truth and user ground truth, according to the three evaluation perspectives.The first two are related to the model, while the last one is related to the user.To further clarify the ground truth, we provide a traditional white-box model as an example.
We propose a geometric classification task.The model should determine which of the categories of circles, lines, triangles and rectangles the input belongs to. Figure 1 shows the samples of input images, including the image a to f.All theses are 8*8 in size, with each grid representing a single pixel.Images a and b should be classified into a circle; images c and d should be classified into a rectangle; image e should be classified into a triangle; image f should be classified into a line.All the inputs should be binarized in the experiments.To achieve the classification of inputs like the images in figure 1, the following white-box model is designed.
We get the corner points of the input image by using the traditional image operator.Corner points refer to points with high curvature and contain the geometric information of the image.Calculate the greyscale difference for each pixel in the input data in eight directions, including 0°, 45°, 90°, 135°, 180°, 225°, 270° and 315°.Then add up the squared results on the same line.
The location is marked as a feature point(corner point) if each direction results in a change that is above a threshold.Such a feature-point-finding algorithm is in fact the detection process of the Moravec Operator [12], which can effectively extract feature points that are significant for the classification task.We generate the feature map displayed in figure 2(a) by applying the method to the inputs in figure 1.The area that the algorithm perceives is also the region that the model actually receives from the input, as shown in figure 2(b), excluding the part beyond the original input.We define it as the input ground truth.The black frame represents the part of the original input which has semantic meaning.
The feature map can be decomposed into local features, as shown in figure 3. Constrain the size of local features.All the local features can be written into a matrix of uniform size.With this restriction, we can transform the algorithm into a convolutional white-box model more easily.Combining these features in different ways can revert the original feature map.We simplify these combinations to make them uniquely correspond to a single classification.There are rules as follows.
If     We can achieve classification by following the rules.We also obtain the feature ground truth shown in figure 4 True .A representative element can be set up for each F i to modify the algorithm to the convolutional network.It has no impact on the prediction.The pixel close to the center is chosen as the representative element, as seen in figure 3. Figure 4(b) illustrates the feature ground truth after the element choice.
The ground truth mentioned above are model-related.The input ground truth represents the perception of the model on the input.The feature ground truth represents the features that influence the decision.The user ground truth is model-independent, as shown in figure 5, define as the part that is semantic on intuition.It includes the valid information received by the users when they deal with the same input.
The process described above is how to determine the ground truth for a specific model.The following definitions are extracted.
Input ground truth The information received by the model from the input.
Feature ground truth The feature that influence the model's decisions.
User ground truth The valid information received by the user.

Metric for evaluation
We propose a metric called non-overlap.It compares the ground truth and the results of the explainer.It is kind of similar to the Hamming distance [13].Taking the task in section 3.1 as an example, the non-overlap is the difference in pixels between the two images.
The matrices True and Exp record the ground truth and the results of the explainer.F is the function acting on them.If . It retains the part above the threshold  . is an indicator function.
input Area is the input size.S represents the score, i.e. the non-overlap.A lower non-overlap means that the explanation is closer to the ground truth.The metric and the ground truth acquisition procedure together constitute our evaluation method.

Experiments
We apply our evaluation method to three image attribution explainers, including Occlusion, Saliency and LRP.With these three explainers, we reconstruct a traditional white-box model, similar to that in section 3.1, into a convolutional network model.Each layer of the network corresponds to a part of the traditional algorithm, with parameters set by humans instead of training.We can considerate that the convolutional model is a white-box.Its ground truth is also the ground truth of the traditional model.

Explainers
In the occlusion explainer, each pixel i in the image data is set to zero in turn.The attribution of the pixel can be viewed as the difference between the original classification score y i and the re-obtained classification score ' i y .Denote the result of Occlusion as R 1 .
The saliency explainer is related to gradients.Each pixel's gradient in the image data is calculated.Denote the result of Saliency as The LRP explainer propagates output to input layer by layer.It decomposes the attribution of position j in layer l into each position in layer 1  l .Then we add up all the attributions that have been decomposed to i. Regarding the result as the attribution of i. Denote the result of LRP as R 3 . 0, 3 i R is the attribution in the input layer and n is the number of layers in the formula below.

Datasets
Our experiments use three datasets, two of which are artificial and one of which is natural.The artificial datasets make up with a line dataset and a geometric dataset.
Line dataset For each image in the dataset, we select a random figure n and denote it as the number of lines.The starting point and line's direction, denoted as ) , ( i l i l y x and i l e , are also chosen at random.We calculate the maximum length of the line based on the starting point and denote it as LM i .Choose a number on the interval (0, LM i ] uniformly at random as the length.Ultimately, a dataset of 8*8*1000 is obtained.
Geometric dataset There are four different types of image data in it, including rectangles, lines, triangles and circles.For the circle, we randomly select the circle's center and label it as ) , ( .Pick a random number as the radius length on (0, LR].Circles can be generated with the center and radius.For the line, two different coordinates are chosen at random.Mark them as the beginning and end of the line.Get the line by connecting them.For the triangle, two points are situated on the same horizontal line at random.The two points mark the beginning and end of the triangle's base side.Let a, an odd number greater than 1, be the base's length.Set the height of the triangle to 1  a .We receive the vertices of the triangle.By using this method, we can get an isosceles triangle that resembles a square triangle.For the rectangle, two points with both distinct x and y coordinates are chosen at random, which are the rectangle's upper left and lower right vertices.The rectangle selected is the space enveloped by these two points.
The natural dataset is the digits dataset from sklearn.datasets.It is a small handwritten digit dataset consisting of 1797 images of size 8*8.

Building the white-box convolutional network
We obtain two types of transparent convolutional networks from the conventional white-box model by the methodology described in Section 3.They are counting networks and classification networks for specific explainers.The following formula is a brief expression of the process of convolutional neural networks.
) ( , 1 , In the first formula, b , and i g represent the output, the kth convolutional kernel, the bias term and the activation function of the ith convolutional layer.In the second formula, P i and W p,i represent the output and the convolution kernel of the ith pooling layer.AVG is the mean function.In the third formula, FC soft , W fc and fc g represent the normalized output, the parameter and the bias term of the fully connected layer.' n P is the flatting output of the last pooling layer.soft g is the normalization function.
The construction process of the convolutional network exactly matches the quantization process of the ground truth.Such a convolutional network does not require to be trained.It is only a platform for evaluation.For counting and categorization, two white-box networks are created.

Counting convolutional networks
We have a counting network searching for certain features in a given input.In particular, we look for 1*3 and 3*1 block matrices and classify the input in whichever of the above feature has the most.The following convolutional network has been designed.
W are sensitive to the location of blocks such as 1*1, 1*2, 1*3, 2*1 and 3*1.The 1*3 and 3*1 blocks is selected by the activation function g.The constant M used in the formula is a large number.After that, the average pooling layer is set up to assist with counting.The following parameters are utilized in a fully connected layer for additional counting and classification.
We get a white-box model since the parameters of each layer of the network are fixed and semantic.It can be enhanced in complexity by deepening the network and changing the convolutional kernel.

Classification convolutional network
The classification convolutional network classifies the input into the four geometric classes, with an algorithmic similar to that of section 3.1.The process includes feature point extraction and classification according to the rules.
Feature point extraction consists of finding the greyscale difference in several directions and and add up the squared results.This process involves three layers of convolution.There are eight convolution kernels in the first layer, i.e. 8 1  k . With a convolution kernel size of 3*3, the convolution performs the task of figuring out the difference in greyscale in each direction.The squaring operation is carried out by using the activation function The second and third layers of convolution implement the summation operation and filter out the regions that exceed the threshold in each direction.The second convolutional layer has four convolutional kernels, i.e. 4 2  k . The third convolutional layer has only one convolutional kernel, i.e.
The fourth and fifth convolutional layers, as well as the fully connected layer, implement rule classification.The essence of the rule is the correspondence between the combination of different local features and the classification.The process is split into finding specific local features, counting the number of the specific features and establishing the correspondence.We use the fourth layer of convolution to find specific local features.As was already mentioned, all of these local features can be expressed as 3*3 matrices due to their size restrictions.Hence, convolution kernels of size 3*3 can detect them.The process is carried out similarly to 4.3.1.The count is implemented in the fifth layer of convolution.Again the processing is similar to the previous subsection.The fully connected layer implements the correspondence between the combination and the classification.To classify each input into the desired class, the parameters of the fully connected layer must be calculated by pre-established rules.

Results and discussions
Three datasets and two white-box networks are put together to generate three test models.To illustrate the procedure and efficacy of the evaluation approach, we have selected three image attribution explainers and generated their explanations on the input data.Considering that there is gradient-related method in the three explainers, gradient experiments are made to examine the performance of the various explainers before and after the gradient vanishing.In addition, we supplement the evaluation by comparing the three explainers' operational efficiency in similar situations.

Qualitative and quantitative analysis
The ground truth emphasize different perspectives.The input ground truth evaluates how well the explainer's performance is in explaining the model's acquisition of input information.The feature ground truth evaluates how well the explainer explains the decision attribution in the model.The user ground truth evaluates whether the explainer meets the requirements of the user's intuition.Taking into account the characteristics of the feature ground truth, the results of the explainer for this aspect only preserve the area that is more important for the decision.We retain the complete explainer results in the evaluation of the input perspective.Using a sample image from the line dataset, we display the findings in figure 6.The figure provides a visual representation of the differences between the three explainers.The figure shows the performance of the three explainers on the same input of the line dataset.Each row's first item displays the original input, and the second item displays the ground truth.They are available before the network is created.The last three items display the explanation outcomes for each of the three explainers.They represent Occlusion, Saliency and LRP, respectively.The red area is the explanation and the white area is the ground truth.We can observe that the occlusion explainer and the LRP explainer's results more following human intuition.They have a more thorough sense of the model's intake of reliable information.Besides this, the saliency explainer and the LRP explainer are more precise in the feature perspective.The quantitative experimental results are shown in figure 7(a), which shows the performance of the three explainers on the three models, including the average results on 100 or 1000 pieces of data.From left to right are the evaluation results of the input perspective, the feature perspective and the user perspective.Models A, B and C, respectively, correspond to the counting task on the line dataset, the classification task on the geometric dataset and the classification task on the natural dataset.The results shown in figure 6 are of Model A. Regarding the input perspective, all explainers perform similarly on Model A and Model B, which is consistent with the result of figure 6.The LRP explainer is somewhat superior.The advantage of it on Model C is more obvious.It concludes that the LRP explainer captures the effective regions of the input more comprehensively.When facing more complex datasets (e.g.natural datasets), the dominance would be further extended.From the feature perspective, all three explainers also perform similarly on Model A. Combining the results of figure 6, the explanation obtained by the occlusion explainer is much larger than the ground truth.This characteristic is even more obvious on models B and C. As to the user perspective, the LRP explainer performs better on all three models, which is related to its algorithm.The outcome of the explainer is related to the input of each layer, including the input data.As a result, the explanation is strictly constrained to the input region that contains non-zero values.

Gradient
Considering that an gradient explainer may more affected by the gradient itself, some information related to the gradient is supplemented in the experiments.Only the evaluation relevant to the model is examined, that is, the input perspective and the feature perspective.The control of the gradient process is done by adjusting the size of the gradient-related parameters.The experiments are carried out on model A, and M is the constant parameter of the activation function.Since the result obtained after the gradient vanishing is a blank plot, the blank interpretation is set as the benchmark, and calculate the score after subtracting the blank benchmark.The experimental results are shown in figure 7(b), which contains the average results of 100 pieces of data on Model A. The result of the input perspective is on the left, and the result of the feature perspective is on the right.As shown in the figure, all three explainers can obtain explanation results when M is less than 80.After reaching a certain threshold (M >=80), the gradient cannot be calculated in the network then, and the result of the saliency explainer turns to zero, indicating that its results are consistent with the blank map.It can be concluded that the saliency explainer is significantly affected by the gradient and has a small range of applications.

Effectiveness
The experiments also evaluate the efficiency of the three explainers through comparing their running time, as shown in figure 7(c).The efficiency of occlusion explainer is significantly poorer than that of the other two explainers.This is due to the fact that every time a pixel in the occlusion explainer is altered, the model must be completely recalculated using the new input.The LRP explainer is slightly better than the saliency explainer, but has little practical significance.

Conclusions
In light of recent research on explanatory evaluation, our method includes three assessment perspectives, input, feature and user.The input perspective evaluates the explainer's ability to fully capture valid information about the model input.The feature perspective evaluates the explainer's ability to capture the direct causes of model decisions, and the user perspective evaluates the explainer's ability to perform well on the user's intuition.We obtain the quantified ground truth corresponding to the three perspectives through the transparency of the white-box model's structure.The ground truth is independent of the explainer.We can consider the ground truth as the true label of the model in terms of explanation.Using this as a benchmark and combining it with a quantitative metric of nonoverlap degree, a comparison between the results of the explainers can be achieved.We apply the evaluation approach to three image attribution explainers in the paper, the occlusion explainer, the saliency explainer and the LRP explainer.We acquire qualitative and quantitative evaluation from the experiments, together with evaluations of the gradients and efficiency.
Our method provides a direct evaluation of a priori knowledge with an independent benchmark not influenced by the results of the explainers.Additionally, we extend the evaluation to natural datasets in the experiments which obtained similar evaluations on both the artificial and natural dataset datasets.Therefore, we can get a relatively accurate evaluation with reasonable complex artificial data which can significantly reduce costs.The evaluation approach may adapt to many contexts flexibly and aids users in selecting explanations that best suit their needs.

Figure 2 .
Figure 2. The process of obtaining the inputs ground truth.Figure 4. Feature ground truth.

Figure 4 .
Figure 2. The process of obtaining the inputs ground truth.Figure 4. Feature ground truth.
(a), which is the region that plays a decisive role in the classification.Take input b as an example.Input b is classified into a circle according to the second rule.The classification is influenced by local feature F 8 and F 9 .The corresponding region in M b is marked out, i.e., the yellow region in b c range based on the circle's center and denote it as (0, LR].

Figure 6 .
Figure 6.Explainers performance on the same model.
10exists, it can be classified into a circle.If there exist two F 8 as well as two F 9 , it can be classified into a circle.If F 4 exists without F 6 and F 7 , it can be classified into a line.If F 5 exists without F 6 and F 7 , it can be classified into a line.If only a single F 2 exists, it can be classified into a line.
If only a single F 3 exists, it can be classified into a line.If only two F 1 exist, it can be classified into a line.If F 6 exists, it can be classified into a triangle.If F 7 exists, it can be classified into a triangle.If only two F 2 exist, it can be classified into a rectangle.If only two F 3 exist, it can be classified into a rectangle.If only four F 1 exist, it can be classified into a rectangle.