Self-attention-driven retrieval of chest CT images for COVID-19 assessment

Numerous methods have been developed for computer-aided diagnosis (CAD) of coronavirus disease-19 (COVID-19), based on chest computed tomography (CT) images. The majority of these methods are based on deep neural networks and often act as “black boxes” that cannot easily gain the trust of medical community, whereas their result is uniformly influenced by all image regions. This work introduces a novel, self-attention-driven method for content-based image retrieval (CBIR) of chest CT images. The proposed method analyzes a query CT image and returns a classification result, as well as a list of classified images, ranked according to similarity with the query. Each CT image is accompanied by a heatmap, which is derived by gradient-weighted class activation mapping (Grad-CAM) and represents the contribution of lung tissue and lesions to COVID-19 pathology. Beyond visualization, Grad-CAM weights are employed in a self-attention mechanism, in order to strengthen the influence of the most COVID-19-related image regions on the retrieval result. Experiments on two publicly available datasets demonstrate that the binary classification accuracy obtained by means of DenseNet-201 is 81.3% and 96.4%, for COVID-CT and SARS-CoV-2 datasets, respectively, with a false negative rate which is less than 3% in both datasets. In addition, the Grad-CAM-guided CBIR framework slightly outperforms the plain CBIR in most cases, with respect to nearest neighbour (NN) and first four (FF). The proposed method could serve as a computational tool for a more transparent decision-making process that could be trusted by the medical community. In addition, the employed self-attention mechanism increases the obtained retrieval performance.


Introduction
Since the outbreak of the pandemic, there has been a large number of methods for COVID-19 computeraided diagnosis (CAD), which are based on computed tomography (CT) imaging [1].The vast majority of these methods employs deep neural networks for the classification of a CT image as positive or negative.Although the application of neural networks in such CAD applications is straightforward and widely adopted, the importance of plain classification is somehow circumscribed by the references of medical literature with respect to the smaller sensitivity of CT-based diagnosis, when compared with RT-PCR [2].Another limitation of plain binary (positive-negative) classification in the context of CAD, is that such an approach is not intuitive and often acts as a "black box", which cannot gain the trust of medical community, whereas it hinders understanding and conceals any bias in classification, as well as in the datasets employed.
In the context of image-based CAD, binary classification can be complemented by content-based image retrieval (CBIR).Such an approach provides radiologists with visual aid in the form of a ranked list of similar images, strengthening confidence in incorporating CADcued results in decision making [3].In addition, CBIR can help physicians to conduct further statistical evaluations on groups of similar patients' profiles [4].
Another element of an image-based CAD approach comes with visualization methods, such as Occlusion [5], Grad-CAM [6] and Lime [7].The output of these methods is often a semantic heatmap, which represents the contribution of each image region to the classification result, offering valuable insights for the decisions and performance of deep learning-based methods, as well for the composition and shortcomings of the benchmark datasets used.With respect to COVID-19, the spatial distribution of lung tissue and lesions affected by COVID-19 pathology could aid the evaluation of disease severity and progress.
Starting from these considerations, we propose a COVID-19 CAD method for CBIR of CT images.Taking into account a preliminary experimental evaluation, the proposed method employs DenseNet-201 [8] for feature extraction and classification of each query CT image.Still, the proposed CBIR framework is generic and can easily encompass alternative classifiers.Beyond binary classification, the proposed method returns a list of classified CT images, ranked according to similarity with the query CT image.In addition, the query CT image, as well as all CT images of the ranked list, are accompanied by a heatmap, which is derived by Grad-CAM and represents the contribution of lung tissue and lesions to COVID-19 pathology.Beyond visualization, Grad-CAM weights are employed in a self-attention mechanism, in order to strengthen the influence of most COVID-19-related image regions on the retrieval result.In qualitative terms, the proposed CBIR framework: (1) provides more complete feedback to the radiologist, when compared to the plain positive/negative feedback provided by classification methods, (2) offers visualizations via Grad-CAM heatmaps, aiding explainability, (3) is self-attention-driven, performing more targeted image retrieval, when compared to the few existing CBIR methods available for COVID-19 CAD.
The rest of this text is organized as follows: section 2 presents the related work in COVID-19 CAD and section 3 presents the proposed CBIR method.Section 4 presents the experimental evaluation on two publicly available datasets.

Related work
Almost immediately after the start of the pandemic, several works aiming at COVID-19 binary classification appeared.Zheng et al [9] designed a 3D deep convolutional neural network (CNN) for COVID-19 detection on chest CT slices.In another work, Chen et al [10] employed a pretrained model, named UNetCC, on high-resolution CT images for COVID-19 detection.Cifci [11] presented a method for the early diagnosis of COVID-19 with CT, using two pretrained models: AlexNet and Inception-V4, as well as transfer learning.He et al [12] proposed a method which synergistically integrates contrastive self-supervised learning with transfer learning in order to learn descriptive and unbiased feature representations and limit overfitting.Wang et al [13] employed ImageNet-pretrained Inception-V3 [14] for COVID-19 classification of CT images.Singh et al [15] employed a CNN-based architecture tuned by multiobjective differential evolution for binary COVID-19 classification.Wang et al [16] employed a pretrained U-Net for lung region extraction in CT images, followed by COVID-19 infection probability estimation obtained by a 3D deep neural network.COVID-19 lesions are localized by combining the activation regions in the classification network and the unsupervised connected components.Amyar et al [17] proposed a multitask deep learning model aiming to jointly classify and segment CT images with respect to COVID-19.Their main considerations are to leverage useful information associated with each task in a synergistic fashion, as well as to deal with limited data availability.Walvekar and Shinde [18] used ResNet-50 in order to distinguish between COVID-19 and other frequent diseases of the respiratory system.In the context of MIA-COV19D contest of ICCV 2021, Gao et al [19] used vision transformer (ViT) methods, based on attention models and DenseNet.In the same contest, Teli [20] proposed TeliNet, a shallow architecture, which is shown to outperform VGG-16 in binary COVID-19 classification, whereas it requires less parameters.Jaiswal et al [21] employed transfer learning and DenseNet-201.
A number of COVID-19 classification methods address the issue of explainability, mostly by incorporating saliency heatmaps.Wang et al [26] redesigned COVID-Net [22] for CT image classification.They explicitly tackled the cross-site domain shift by separately conducting feature normalization in a latent space.In addition, they used a contrastive training objective to enhance the domain invariance of semantic embeddings and boost classification performance.Finally, they employed Grad-CAM [6] for visualization purposes.Hu et al [27] proposed a VGG-inspired architecture for COVID-19 classification and lesion detection in CT images.For lesion detection, they employed 'integrated gradients' [28] in order to obtain category-specific, pixel-wise saliency heatmaps in a weakly supervised fashion.In the same direction, Wu et al [29] proposed an explainable method for joint COVID-19 classification and segmentation (JCS) in chest CT images, accompanied by Grad-CAM-derived heatmaps.
A limited number of works recognised the potential of CBIR for COVID-19 CAD.Shakarami et al [4] proposed COV-CAD, a system for COVID-19 CAD of CT images, which encompasses a CBIR component and uses a variant of AlexNet for feature extraction.In addition, COV-CAD employs majority voting on CBIR results, aiming at CBIR-informed COVID-19 classification.Qi et al [30] used ResNet-50 for feature extraction and k-NN for classification, as well as for CBIR.Interestingly, in the latter case the nearest neighbors form the ranked list of retrieved images.Pogarell et al [31] evaluated the potential of a CBIRbased system for the differentiation of several interstitial lung diseases, including COVID-19.Their experiments demonstrate that when filtering the results of CBIR by predominant characteristics of each disease, the accuracy in diagnosing interstitial lung diseases in CT is drastically improved, both for novices and resident physicians.
Several reviews and comparative studies have appeared in the literature, addressing COVID-19 CAD on CT images.Ardakani et al [32] proposed a method for COVID-19 detection in CT, investigating variants with ten different CNN-based architectures: AlexNet, VGG-16, VGG-19, SqueezeNet, GoogleNet, Mobile-Net-v2, ResNet-18, ResNet-50, ResNet-101, and Xception.The last two architectures achieved the highest performance.Shi et al [33] reviewed the rapid response of medical imaging community towards COVID-19, covering image acquisition, segmentation, diagnosis, and follow-up, both for x-ray and CT imaging.In their comparative study, Shah et al [34] concluded that VGG-16 obtains the highest classification performance.However, this model is rather demanding in terms of computational resources.In another comparative study, Seum et al [35] demonstrated that ResNet-18 and DenseNet-201, which are less demanding than VGG-16 in that respect, perform better in COVID-19 classification.

Methods
This section presents the theoretical background and the proposed CBIR method.

DenseNet
The main idea introduced by DenseNet was to use direct connections from any layer to all subsequent layers [8].The l-th layer receives the feature maps of all preceding layers, x 0 ,K,x l−1 , as input: where [x 0 ,K,x l−1 ] represents the concatenation of feature maps generated in layers 0,K, l-1.This network architecture is named dense convolutional network (DenseNet), because of its dense connectivity.For ease of implementation, the multiple inputs of H l () in equation ( 1) are concatenated into a single tensor.In addition, H l () is defined as a composite function of three consecutive operations: batch normalization, a rectified linear unit (ReLU) and a 3 × 3 convolution [8].
To facilitate down-sampling in DenseNet, the network is divided into multiple densely connected dense blocks.The layers between blocks, which perform convolution and pooling, are referred as transition layers and consist of a batch normalization layer and an 1 × 1 convolutional layer, followed by a 2 × 2 average pooling layer.Several instances of DenseNet have been studied in [8], including DenseNet-201 (with 201 layers).
In the DenseNet-201 variant of Jaiswal et al [21], the softmax activation function used in the standard DenseNet-201 architecture is removed.A flattening layer is added, followed by 2 dense layers with 128 and 64 neurons and dropout rates equal to 0.2 and 0.3, respectively.The network ends with sigmoid activation for binary classification.

Grad-CAM
The class activation mapping (CAM) method [6] modifies CNN architectures by replacing fully-connected layers with convolutional layers and global average pooling, in order to achieve class-specific feature maps.A limitation of CAM is that it dictates a specific form of CNN architecture, in which global average pooling is performed over convolutional maps, prior to prediction.However, such an architecture may result in lower accuracy in some tasks, when compared to other architectures, as is the case with image classification, or may simply be inapplicable to other tasks, such as image captioning or visual question answering.
CAM is based on the following equations:


are the features of the last layer, the pooled features, the class scores and the class activation maps, respectively, H, W are the dimensions of the last layer, K is the dimension of the pooled feature vector, C is the number of classes, and h, w, k, c are the respective indices.
Selvaraju et al [6] introduced Grad-CAM, an alternative approach for combining feature maps, which uses the gradient signal in any CNN-based architecture.For a fully-convolutional architecture, Grad-CAM generalizes CAM.
In Grad-CAM, any layer can be used to derive activations Î ´Á H W K


. The gradients of class scores S c with respect to A are computed as: The above gradients are globally average pooled to obtain weights a Î K  : ( ) The activation maps Î Ḿc H W


are computed according to: A rectified linear unit (ReLU) is applied to the linear combination of maps, since only the features that have a positive influence are of interest.More details on Grad-CAM can be found on [6].

CBIR pipeline
Rather than providing a plain classification result for each input image, a CBIR method traces a reference dataset and returns a list of images, which are ranked according to similarity.For this, a similarity measure is employed, quantifying similarity between the query image and all images of the reference dataset.Such a measure could be the L1 or L2 (Euclidean) distance between appropriately defined feature vectors, representing each image.In the context of CAD methods, CBIR provides radiologists with visual aid, justifying the produced results, whereas it strengthens confidence in incorporating CAD-cued results in decision making [3].
where F is the feature vector from the l-th network layer, the parameter λ adjusts relative importance and the vector A comprises the Grad-CAM coefficients [6]: where S c is the score assigned from the network to class c, h, w are the dimensions of the H × W image in the l-th network layer and K is the size of F. This weighting results in a feature vector -F Grad CAM , which is more targeted to COVID-19 than the initially pooled feature vector F.This can be explained by the fact that image regions associated with increased weights are the ones highlighted by Grad-CAM due to their contribution to the classification result, with respect to COVID-19 pathology.A somehow similar idea has appeared in [36], however that method uses a Kullback-Leibler-based loss function in the context of generic visual classification.The three-fold feedback provided by the proposed CBIR method is way beyond the usual 'black-box' binary classification result, regularly provided by most COVID-19 CAD methodologies.The Grad-CAM heatmap highlights lesions mostly related to COVID-19 pathologies, such as ground-glass opacities (GGOs).The retrieved ranked list provides contextual information which can be co-evaluated by the expert physician, along with the accompanying ground truth labels and the derived Grad-CAM heatmaps.

Experimental evaluation
In this section, we present the experimental evaluation of the proposed method.We provide details on the datasets used, the experimental configuration and the results obtained for classification and CBIR.

Datasets
The proposed method has been experimentally evaluated on two publicly available datasets of chest CT images, which include cases of COVID-19, as well as healthy cases.
The COVID-CT dataset [37] is publicly available and has been acquired from 216 patients.The dataset comprises 812 chest CT images, divided to 349 COVID-19-positive images and 463 COVID-19-negative images.The COVID-19-positive images have been acquired from preprints in medRxiv and bioRxiv repositories between January 19 and March 25, 2020.
The SARS-CoV-2 dataset [38] is publicly available as well and comprises 2482 chest CT images from 120 patients, with 1252 images acquired from 60 COVID-19-infected patients and 1230 images acquired from 60 non-infected patients.All data have been acquired from hospitals in Sao Paolo, Brazil.
Both publicly available datasets have been widely adopted by the medical image analysis community and are commonly used for benchmarking COVID-19 CAD methods.Although the emergence of COVID-19 variants has pathological and clinical implications, from an image processing and machine learning perspective, CT images of COVID-19 variants are still characterised by bright lesions and spots, often associated with GGOs [39].This justifies the use of these benchmark datasets for the evaluation of recently published methods [40,41].

Experimental configuration
The proposed method has been implemented in Python 3.9, using Keras library and Tensorflow backend.The implementation is available here.The classification is performed by means of the DenseNet-201 variant of Jaiswal et al [21].The network has been pretrained on ImageNet and the extra layers described in subsection 3.1 are trained using the COVID-19

CBIR
Two CBIR variants are evaluated, employing either the feature vector F (section 3) extracted directly from DenseNet-201 with uniform weighting, or the feature vector -F grad CAM , weighted with grad-CAM coeffi- cients (equation ( 9)).Table 3 presents the nearest neighbor (NN) and the first-four (FF) accuracy of the two CBIR variants, for various subsets of both datasets.In most cases, the Grad-CAM-guided variant obtains slightly higher NN and FF.The later metrics are more important in the context of a CAD tool, in which only the first few CT images of the ranked list are actually evaluated by the physicians.Also, it can be noted that in the case of subsets of SARS-CoV-2 dataset, which is about an order of magnitude larger than COVID-CT dataset, the Grad-CAM-guided CBIR variant (Var.2) consistently outperforms the uniformly weighted CBIR (Var.1). Figure 3 presents example queries and CBIR results.The CT image query, as well as the retrieved CT images, are marked as either positive or negative, with red or green respectively, according to the ground-truth label.COVID-19-related artifacts are marked with arrows.In the first positive example (first line), peripheral GGOs can be observed in the  upper and lower portion of the lungs.In the second positive example (second line), linear consolidations can be observed in the lower lobes and co-exist with GGOs in some images.It can be observed that the CT images retrieved are visually similar to the CT image queries.In almost all retrieved CT images, the label matches that of the query, with the exception of the last example, in which the query is negative, whereas the second and the fourth CT images are positive.All CT images are accompanied by their respective Grad-CAM heatmap.It can be observed that in the case of COVID-19-positive images, the heatmaps highlight bright spots and lesions, often associated with GGOs.Moreover, the highlighted regions of COVID-19-positive images tend to occupy lung periphery.In the cases of COVID-19-negative images, the heatmaps tend to be less focused in specific regions.

Conclusions
This work introduces a novel Grad-CAM-guided method for CBIR of CT images for COVID-19 CAD.
The proposed method employs DenseNet-201 for the classification of each CT image as COVID-19 positive or negative.Beyond classification of each CT image query, the proposed method returns a list of labelled CT images, ranked according to their similarity with the query.All images are accompanied by a heatmap, which is derived by Grad-CAM and represents the contribution of lung tissue and lesions to COVID-19 pathology.This type of feedback is clearly more informative than a simple positive/negative classification result, providing radiologists with visual aid and strengthening their confidence in incorporating CAD-cued results in their decision making.Grad-CAM is not only employed for visualization, but also as part of a self-attention mechanism, resulting in retrieval, which is more targeted to COVID-19 pathology, when compared to uniformly weighted feature vectors.Overall, rather than focusing on the role of alternative CNN architectures or incremental improvements in terms of binary classification performance, this work introduces a novel self-attention-driven CBIR framework which aids diagnosis by means of a more versatile and targeted feedback.
Experiments on two publicly available datasets lead to the following conclusions: (1) The binary classification accuracy obtained by means of DenseNet-201 is 81.3% and 96.4%, for COVID-CT and SARS-CoV-2 datasets, respectively, with a false negative rate which is less than 3% in both datasets.
(2) The Grad-CAM-guided CBIR framework slightly outperforms the plain CBIR in most cases, with respect to NN and FF (table 3).
Future perspectives of this work include more complex classification schemes, such as CBIR-guided weighted majority voting, as well as the utilisation of Grad-CAM heatmaps for CT image segmentation.Another direction for future work involves conducting a long-term study of clinical outcomes following the routine use of the proposed method in everyday hospital settings.

F
Figure 1 illustrates a visual summary of the proposed CBIR method.An ImageNet-pretrained DenseNet-201 CNN architecture is fine-tuned by means of a training set, comprising CT images, labelled as COVID-19 positive or negative.The trained model is applied on CT images of the training set, in order to obtain a Grad-CAM heatmap for each training CT image.The selection of this particular CNN architecture is based on preliminary experiments performed in the context of this work (see section 5).It should be noted that the proposed CBIR framework is generic and may encompass alternative CNN architectures.The trained model is applied on each new CT image query and the provided feedback is three-fold (marked with yellow in figure 1): (1) a binary classification result, (2) a Grad-CAM heatmap indicating the most important lung lesions, with respect to COVID-19, (3) a ranked list of labelled CT images, according to their similarity with the query CT image.In figure 1 the label is marked with red for COVID-19 positive and green for COVID-19 negative.The feature vectors -Grad CAM used for CBIR are derived by:

Figure 1 .
Figure 1.Visual summary of the proposed CBIR framework.

Figure 2
illustrates the training and validation accuracy per training epoch, for both datasets.The small fluctuations in accuracy are a side-effect of the Adam optimizer employed.It should be stressed that the focus of this work is not classification itself.Rather than that, the classification should be viewed as a component of the proposed CBIR framework.

Figure 2 .
Figure 2. Training accuracy per epoch, when training in COVID-CT and SARS-CoV-2 datasets (training accuracy in blue, validation accuracy in orange).

Figure 3 .
Figure 3. Example CBIR results.Each CT image query appears at the left.COVID-19-related artifacts are marked with arrows.All CT images are accompanied by their respective Grad-CAM heatmaps (below).Positive and negative CT images are marked with red and green, respectively.

Table 1 .
Classification accuracy for various NN configurations.

Table 2 .
Confusion matrix for DenseNet-201-based classification on COVID-CT and SARS-CoV-2 datasets.CT dataset, 662 samples are used for training and 150 samples are used for testing.In the case of SARS-CoV-2 dataset, 1985 samples are used for training and 497 samples are used for testing.Table 1 presents the classification results obtained for 3 neural network configurations.It can be observed that DenseNet-201-based configuration achieves the highest overall classification accuracy, equal to 81.3% and 96.4% in COVID-CT and SARS-CoV-2 datasets, respectively.Based on this result, this configuration is employed in the experiments to follow.Table2presents the confusion matrices of the classification results obtained by DenseNet-201 for both datasets.It can be noted that the false negative predictions are not frequent (less than 3% in both datasets).Overall, the vast majority of predictions lies in the diagonal of the confusion matrix.