Design and Implementation of an Attention-Based Interpretable Model for Document Classification

With the rapid development of artificial intelligence and machine learning, deep learning technology has been successfully applied to various tasks in the field of natural language processing. However, the erroneous decisions produced by the deep learning model and the characteristics of its black box cause users to doubt its decisions. We urgently need the model to have the ability to make rationalized explanations of its decisions, that is, interpretability. In this paper, we propose an attention-based interpretable model for document classification, which aims to have a better model performance and at the same time have better interpretability. Experiments show that our classification model is close to the performance of other mainstream models. Also, the attribution score explanation provided by the interpretation method are effective.


Introduction
In recent years, with the rapid development of artificial intelligence and machine learning, deep learning [1] has been widely used in computer vision [2][3], natural language processing [4], speech recognition [5] and other fields. The deep neural network model not only surpasses the traditional machine learning model in many field tasks, but also becomes the best model at present, even comparable to human results in some real tasks [6]. Nowadays, people enjoy the convenience of deep learning technology through intelligent devices, and at the same time, deep learning technology also plays an important role in assisting people's decision-making in many fields.
Compared with traditional machine learning methods, deep neural network has achieved great success because of its ability to automatically extract features from data, carry out end-to-end learning, and do not need researchers to manually construct features. But there is also an unavoidable problem, that is, deep learning model may have unexpected errors. The researchers found that even if only slight changes were made to the input image of the model, the deep neural network model may get completely irrelevant prediction, and in most of the experiments on the counter samples, the deep neural network model shows poor robustness and is easy to be deceived by the counter samples. Also, most of the deep neural network models have millions of parameters to learn, and the huge number of parameters makes the model basically unable to be understood.
In this paper, we focus on the interpretability of model based on attention mechanism under text classification task, and propose an interpretable text classification model based on attention mechanism, which has good text classification performance and strong interpretability. Specifically, we first introduce the self-attention mechanism into bidirectional long short-term memory network, which increases the model's ability of modelling the relationship between words. Second, we introduce the attribution score to compute the importance of the input to model decision.
This paper uses many experiments and various evaluation metrics to verify the effectiveness of the model proposed in this paper. Also, we use logic rules to explain the decision that the model making.

Related Work
In recent years, researchers proposed many methods to improve the interpretability of deep learning model. The most direct interpretation method is to add the explanation structure into the model, which is also called the active interpretability intervention [7]. This method is based on designing the model architecture or modifying model training method in the process of model training. Bahdanau et al. add attention mechanism to seq2seq framework to improve the performance of the neural machine translate model, also regraded the attention weight as the soft align in translation [8]. Another interpretation method corresponding to the active intervention is the post-hoc interpretation method, which is also the mainstream interpretation method now. The method is suitable for explaining the model with large parameter and complex network structure. The most classic one is Local Interpretable Model-Agnostic Explanations (LIME) [9]. Also, some post-hoc interpretation method use visualization-based methods like Class Activation Mapping (CAM) [10], Grad-CAM [11] and Grad-CAM++ [12].
According to whether the interpretation method is related to the model, the interpretation method can be divided into model specific methods and model agnostic methods. The interpretation method of a specific model is limited to a specific model, so a specific model must be used for the task to obtain the interpretation of the method. If you want to get the decision tree interpretation of the model, you can only use the tree model for tasks. The interpretation method of the specific model related to the model corresponds to the interpretation method of the non-specific model, that is, the prediction and interpretation of the model are two independent parts respectively. This kind of method generally uses the post-hoc Interpretative method as the interpretation of the model.

The Attention-Based Interpretable Model
In this paper, we propose an attention-based interpretable model. We first divided our proposed methods into two parts. First is the classification module, and the second part is the interpretation module. The classification module takes the input x, and give the classification result y. After that, the interpretation module gives the explanation of the model decision. In this part, we will introduce the architecture of the model.

Classification module
As illustrated in Figure 1, based on Zhang's model [13], this paper proposed a self-attention based bidirectional long short-term memory model (SA-BiLSTM).
The model contains five layers. The first layer, the input layer takes the sentence as the input x, and then passed the sentence to the word embedding layer. The word embedding layer uses pre-trained word embedding map each word into fixed size word vector. The above process can be described by following, where refer to the number of words in the sentence, and refer to the vector of the word i. The LSTM layer take the as input and computes the hidden state at time . We use bidirectional LSTM, so we can get the forward hidden state ℎ ⃗ , and the backward hidden state ℎ ⃖ at time . The above process can be described by following, ℎ ⃖ LSTM , ∈ n, 1  Figure 1 Architecture of SA-BiLSTM. we use element-wise sum to combine the forward hidden state ℎ ⃗ and the backward hidden state ℎ ⃖ : Inspire by the self-attention used in transformer [14], we use self-attention mechanism to compute the attention weight in self-attention layer. The self-attention layer first concatenates all output of the LSTM layer as metric , and then multiply metric by trainable metrics , and : ℎ : ℎ : … : ℎ , ∈ 1, Then use , and to compute the self-attention layer output * : * softmax ⋅ / ⋅ The output layer use MLP to compute the model decision y according to * : where and are trainable parameters.

Interpretation module
In interpretation module, we interpret the model by computing the attribution score. The attribution score measures the importance of the input feature to the model output. Inspire by ATTATTR proposed by Hao et al. [15], we propose attention and integrated gradient mixed attribution method (ATTIG), which can not only measure the importance of the input words also can measure the strength of the relationship between words. Given the input x, the model's output y and the attention weight A computed by model, the ATTIG can be described by following, The attribution score can be efficiently computed via Riemman approximation of the integration [16], Specifically, we use m-step approximation from the zero attention matrix to the original attention weight A: Finally, we maximum the ATTIG score by column as the attribution score of the input x:

Experimental Result and Analysis
In this paper, we try to clarify the performance of the classification model and the effectivity of the interpretation methods. We use classification task to verify the performance of the classification model and use interpretability task to verify the effectivity of the interpretation methods.

Datasets
We use Stanford Sentiment Treebank v2 (SST2) and THUCNews as our datasets. The SST2 dataset contains two categories, which distinguish a sentence as a positive or negative sentiment. THUCNews is generated by filtering historical data of Sina News RSS subscription channel from 2005 to 2011. The THUCNews dataset contains ten categories.

Compared methods and evaluate metrics
For the classification task, we use CNN [17], LSTM [18] and the Transformer [14] model as the compared methods. And we use the precision, recall and the F1 score as our evaluate metrics. For the interpretability task, we use attention, GradientX [19], deepLIFT [20] and integrated gradient [16] as compared methods. And we use the difference of the precision, recall and the F1 score between before and after masking the max attribution score word.

Experimental result
The performance comparison of different methods of classification task in SST2 dataset and THUCNews dataset is demonstrated in Table 1.   Table 2, our propose model ATTIG outperforms other methods in both SST2 dataset and THUCNews dataset.

Conclusion
Though its outstanding forecasting performance, the lack of interpretability brings many limitations in utilizing deep learning models in most realistic scenarios. In this paper, we propose an attention-based interpretable model. We divide the model into two modules, the classification module and the Interpretation module. The classification module takes the input x, and give the classification result y. The interpretation module gives the explanation of the model decision. We design various experiments to prove that our model has good performance and good interpretability.