The effects of topological features on convolutional neural networks—an explanatory analysis via Grad-CAM

Topological data analysis (TDA) characterizes the global structure of data based on topological invariants such as persistent homology, whereas convolutional neural networks (CNNs) are capable of characterizing local features in the global structure of the data. In contrast, a combined model of TDA and CNN, a family of multimodal networks, simultaneously takes the image and the corresponding topological features as the input to the network for classification, thereby significantly improving the performance of a single CNN. This innovative approach has been recently successful in various applications. However, there is a lack of explanation regarding how and why topological signatures, when combined with a CNN, improve discriminative power. In this paper, we use persistent homology to compute topological features and subsequently demonstrate both qualitatively and quantitatively the effects of topological signatures on a CNN model, for which the Grad-CAM analysis of multimodal networks and topological inverse image map are proposed and appropriately utilized. For experimental validation, we utilize two famous datasets: the transient versus bogus image dataset and the HAM10000 dataset. Using Grad-CAM analysis of multimodal networks, we demonstrate that topological features enforce the image network of a CNN to focus more on significant and meaningful regions across images rather than task-irrelevant artifacts such as background noise and texture.


Introduction
Convolutional neural networks (CNNs) have proven their ability to extract complex patterns from images for image classification tasks [1][2][3][4][5][6][7]. However, it is well known that CNNs can be biased towards task-irrelevant patterns such as background noise and texture rather than towards the global shape of an object of interest in a given image [8,9]. This comes from the fact that CNNs extract local features with their limited receptive field and the max pooling operations that tend to preserve high-frequency information such as noise and texture.
Topological data analysis (TDA) is a recent mathematical development that characterizes the global structure of data based on computational topology. Persistence diagram [10] is a way to represent data as a multiset containing topological properties called persistence homology [11]. By the nature of a multiset, it is challenging to feed persistence diagram directly into a neural network. To resolve this difficulty, several vectorization methods for representing persistence diagram with a fixed-size vector have been proposed such as persistence image (PI) [12], persistence landscape (PL) [13], Betti sequence (BS) [14], etc.
With such vectorization methods, several studies have proposed neural network architectures that take topological features as inputs. In [14], a BS was used as an input for a one-dimensional CNN model to solve time-series classification problems. In [15], a concatenated vector of the raw signal and the corresponding persistence vector was fed into a CNN for gravitational wave detection problems. These earlier studies, however, used a unimodal network that accepts only input from one modality (or channel).
Recent studies have shown that multimodal networks, which take original data and the corresponding topological features as inputs, achieve better performance than the original unimodal network without the topology network. In [16], PersLay, a deep learning layer, was proposed, which takes a raw persistence diagram as an input and combines it with a graph feature to solve several graph classification problems. TDA-Net [17], a generic network architecture that combines a two-dimensional CNN and various types of representations for persistence diagram, improves classification performance by taking an image and the corresponding BS simultaneously. It was applied successfully to COVID-19 classification from CXR-images [17].
Despite its successful performance, the multimodal network of combined TDA and CNN lacks an explanation as to how and why topological features enhance the discriminative power of the original CNN. However, it is crucial to understand the effects of topological features on a neural network to effectively integrate TDA with a deep learning framework. To analyze the effects of topological features on a CNN, we, in this paper, demonstrate how the behavior of a CNN changes when topological features are fed into the CNN. For this, we use the method of visualizing the saliency maps of the input image and its topological features. A saliency map of an input image shows the degree of influence of each pixel in the image on model prediction [18]. There have been several approaches to generating a saliency map, which can be categorized into the following two approaches.
The first approach, namely, the input-perturbation-based approach [19][20][21], perturbs small regions of inputs and measures the changes in the model prediction. If the perturbed region causes a considerable change in the prediction, it is regarded as a significant feature of the model decision. This approach deforms the given image by randomly masking the image regions with small patches. However, the perturbation method is inadequate for analyzing the combined topology net, i.e. CNN-TDA Net for our research. This is because such a deformation in an image could easily lead to a large change in topology, particularly a change in persistent homology used in our research. For persistent homology, the n-dimensional hole structure, such as the number and the size of holes, could be easily changed due to the deformation. Thus, it is difficult to determine whether the resulting change by perturbation reflects the actual significance of the region or is simply due to artifacts.
The second approach, namely, the gradient-based approach, is free of such issues, and we choose this method for our analysis. Gradient-based approaches, [22][23][24][25][26], contrary to the perturbation method, measure the importance of each region of an image by computing the gradient of model prediction with respect to features. The gradient-based method is more desirable for analyzing the influence of topological features because no corruption of the input image is used, and the individual effects of the visual (CNN) and topological features (TDA-Net) can be measured by computing the partial derivatives of the model prediction with respect to both features independently. Furthermore, it can easily be extended to multimodal networks.
One difficulty with the saliency approach is determining whether a saliency map depends on the model parameters, so that it can truly capture the regions affecting the model prediction. In [27], two methods were proposed for determining the sanity of gradient-based saliency maps, and Grad-CAM [25] passed the sanity checks. Thus, saliency methods based on Grad-CAM are suitable for comparing the differences between two networks. There have been advanced variants of Grad-CAM [26,28]; however, in this paper, we use the original Grad-CAM to focus on showing how the behavior of a CNN changes when additional topological information is given.
In this paper, we are concerned with the question: 'How does the topology network improve model performance when combined with a CNN?' . To answer this question, we first propose a methodology to utilize Grad-CAM analysis for multimodal networks and visually demonstrate the effects of topological signatures on a CNN. Specifically, we apply multimodal Grad-CAM to CNN-TDA Net [17] trained on various image classification datasets. Second, we propose a methodology to transform a feature map obtained from the considered BS and Grad-CAM into the region of the input image. We refer to this transformation as the topological inverse image map. In general, the topological inverse map is not unique; however, in our case, it is well-defined because there is one-to-one correspondence between the constructed complex and the filtration as explained later in this paper. By visualizing the top five individual feature maps of large channel importance, it is shown that there are redundant and meaningless feature maps obtained from a single CNN, whereas a CNN-TDA Net significantly reduces the number of redundant and meaningless feature maps and creates task-relevant feature maps. The results show that topological features from the topology network enforce an image network to focus more on meaningful regions across all images, whereas a single CNN is biased to task-irrelevant artifacts such as background noise and texture.
The rest of this paper is structured as follows. In section 2, we provide an overview of TDA based on persistent homology. Building cubical complexes is a crucial step for persistent homology on images for the combined CNN and TDA multimodal network towards calculating BS. BS is the main vectorization used in this paper. Thus, the building strategies are explained and the definition of BS is provided in this section. The TDA-Net, a multimodal network of combined CNN and TDA, and Grad-CAM, the main analysis tool for the visualization of feature maps, are explained in section 3. In section 4, we explain our main analysis tools for demonstrating the effects of topological features on CNNs, that is, Grad-CAM analysis on multimodal networks and the topological inverse image map. The description of numerical experiments using those two famous datasets, i.e. the transient versus bogus image dataset and the HAM 10000 dataset are presented in section 5. The numerical results are presented in section 6. In this section, the effects of topological features on a CNN are clearly visualized through Grad-CAM and the topological inverse image map. In addition, for quantitative analysis, the Intersection over Union (IoU) scores of both the CNN and CNN-TDA Net are computed and compared. Finally, in section 7, we provide a brief conclusion and discuss future work.

TDA on images and Betti sequences
TDA is a technique for characterizing the global structure of given data by using computational topology. Persistent homology is one of the main methodologies of TDA. With this method, the topological feature of the given data is extracted as follows. First, we construct a filtration, a nested sequence of simplicial (or cubical) complexes indexed by filtration values. Second, we compute persistence homology for the filtered complex and represent it as persistence diagram. Persistence diagram is the multiset of the birth-death pairs of each kth homology class. Due to its multiset nature, the persistence diagram is unsuitable for use as an input in a neural network. Thus, persistence diagram is converted into a fixed-size representation using vectorization methods, such as PI [12], PL [13], and BS [14]. In this section, we briefly provide an overview of each component of the procedure above.

Cubical complex and persistent homology
Images consist of discrete pixels (or voxels) on a fixed uniform grid in Euclidean space R d . Intuitively, these pixels have some kind of connection structure. This connection structure may form an edge or a surface of image. With the aid of computational topology, this connectivity can be handled suitably and calculated numerically.
Graphs or simplicial complexes are typical examples, which represent the connection between objects such as edges or planes. Cubical complexes are generally used to represent the connectivity in images. Since images are defined on mesh, cubical complexes, whose bases are elementary cubes, are suitable to represent the topological features of the given image data.
A cubical complex is a collection of vertices, edges, squares, and higher dimensional analogues on a regular grid in R d . The formal definition is as follows: Definition 1 (Cubical complex). An elementary cube Q ⊂ R d is defined as a product Q = I 1 × · · · × I d where each I j is either a singleton set {m} or a unit-length interval [m, m + 1] for some integers m ∈ Z. The number k of the unit-length intervals in the product of Q is called the dimension of cube Q and we call Q a k-cube. If Q and Q ′ are two cubes and Q ⊆ Q ′ , then Q is said to be a face of Q ′ . A cubical complex X in R d is a collection of k-cubes (0 ⩽ k ⩽ d) such that (i) every face of a cube in X is also in X; (ii) the intersection of any two cubes of X is either empty or a face of each of them.
In order to obtain a cubical complex of a given image, choose an appropriate threshold (grayscale value) and obtain a binary image. Those pixels (e.g. figure 1) assigned the value of 1 in the binary image on the grid in R 2 are regarded as a point (vertex). Now one can define a cubical complex on the point set. See figure 1.
Representing an image as a cubical complex allows us to observe the topological features in the image such as the connected components and loop structures. Moreover, these topological features in the image are characterized by the notion of homology. Hence, one may find the number of connected components or loops in the image by calculating its homology. Homology of an image is obtained through the following steps.
(i) Choose a threshold and define a cubical complex X.
(ii) For each k = 0, 1, 2, define a free abelian group (or vector space over finite field F 2 ) C k (X) whose basis is a set of all k-cubes in X. Each C k (X) is called kth chain group or chain module. Let C −1 (X) = 0. (iii) For each k, define a homomorphism ∂ k : C k (X) → C k−1 (X) which maps k-cubes to sum of its faces in C k−1 (X). Each ∂ k is called the kth boundary operator. (iv) Now, the kth homology group H k (X) is defined as the quotient space ker ∂ k /im∂ k+1 for each k. Notice that the composition ∂ k • ∂ k+1 is equivalent to the zero map. It follows im∂ k+1 ⊆ ker ∂ k and then the homology group H k (X) is well-defined.
The rank of the kth homology group H k (X) is called the kth Betti number or generator. Roughly speaking, the Betti number is the number of k-dimensional holes in X such as connected components (zero-dimensional holes), loops (one-dimensional holes) and so on.
Using homology, useful topological information can be calculated through matrix computations. For this, we need to select appropriate thresholds to define a cubical complex. Persistent homology, a major tool of TDA, is a generalized concept of homology that helps us to learn how homology changes with increasing thresholds. The topological features used in a CNN-TDA Net considered in this paper are calculated based on persistent homology.
A grayscale image is represented by a function I : Ω ⊆ R 2 → R where Ω is the grid on which the image is defined and I(v) is the grayscale value of the pixel v ∈ Ω. In general, the grayscale value of one pixel takes an integer value between 0 and 255. Additionally, let I −∞ be an empty set. A sublevel set of a grayscale image for a value 0 ⩽ i ⩽ 255 is a set of all pixels in the grid with a pixel intensity less than or equal to i, i.e. I i := {v ∈ Ω : I(v) ⩽ i}. A sublevel set may be realized with a binary image in which 1 is assigned to pixels of the sublevel set and 0 is assigned to pixels on the remaining grid. We note here that we mainly consider the grayscale images in this paper and the colored images will be considered in our future research.
Let X i be the cubical complex defined on the binary image of the sublevel set I i . Then a sequence ∅ = I −∞ ⊆ I 0 ⊆ · · · ⊆ I 255 of the sublevel sets induces a nested sequence of cubical complexes ∅ = X −∞ ⊆ X 0 ⊆ · · · ⊆ X 255 = X. This sequence is called the filtration.
The inclusion relation of cubical complexes in the filtration is naturally extended to the inclusion relation of the corresponding homology. More precisely, this idea is based on the functoriality that appears throughout various mathematical applications. That is, the inclusion map ι i,j : Now we can define the persistent module and homology: Definition 2 (The kth persistence module). For the given filtration ∅ = X −∞ ⊆ X 0 ⊆ · · · ⊆ X n = X and fixed dimension k, the kth persistent module PM k consists of the pair of the k-dimensional homology groups (or vector spaces) and linear maps where H k (X i ) is the kth homology for each X i and f k i,j :  The functoriality of persistent homology plays a critical role to analyze whole scheme for the evolution of the space with respect to the variation of parameters. One can observe overall homology according to the change of the parameter.
Recall that the generators of each homology represent the connectivity of space, such as connected components, loops, voids and so forth. Then for any i and j with i ⩽ j, the generator of the previous homology group H k (X i ) may be maintained as the generator of the latter H k (X j ) under the linear map f k i,j or no longer be a generator as expressed by the linear combination of the generators of H k (X j ). H k (X j ) may also contain new generators that were not present in H k (X i ). This means that information about the connectivity of each space is 'newly born' , 'persist' , or 'merged away' as the scale changes.
The The decomposition theorem of persistence module tells us that persistent homology can be decomposed algorithmically by a multiset of intervals {[b n , d n ]} n in which each interval [b, d] corresponds to a topological invariant that is born at b and dies at d [29]. Then, the collection of intervals {[b n , d n ]} n for persistent homology is called persistence barcode. The persistence diagram is a collection of points in R 2 that each point (b n , d n ) corresponds to the interval [b n , d n ] in barcode. Persistence barcode and persistence diagram have exactly the same information. Figure 2 shows persistent homology of the left image in (a). (a) shows the filtration of the given image, (b) the corresponding persistence barcode, and (c) the corresponding persistence diagram. For the filtration value i = 0, only five pixels with a zero intensity are activated, and form five connected components. These five connected components correspond to the five solid lines in persistence barcode. For X 1 , the five connected components are integrated into one so that the four solid lines in the barcode end at i = 1. One connected component remains until X 4 as shown in the figure-the first orange lines in the barcode and the point (0, 4) in the diagram. For one-dimensional homology, one loop enclosed by the connected components is born at X 1 and dies at X 4 corresponding to the red dashed line in the barcode and the red dot in the diagram.
As shown above, persistence diagram contains useful topological characteristics of the given data. However, it cannot be fed into the neural networks directly due to its multiset nature. Vectorization is a way of feeding the topological features appearing in persistence diagram into the neural networks. Various vectorization methods have been proposed that convert persistence diagram into a finite size representation, such as BS, persistence landscape, and PI. Let 1 , 2 , · · · , d be a sequence of filtration values and k ∈ N be a fixed number. Briefly speaking, the kth BS [14] is defined by [12] is an image-like representation that applies the Gaussian kernel to each point in persistence diagram.

Betti sequences
In this work, we utilized BS for the CNN-TDA Net as the vectorization of topological features. For the given filtered complex X ϵ , the BS is constructed by tracking the variation of the Betti numbers of each complex X ϵ with the increasing value of . In other words, the Betti curve (BC) is a function assigning each value to the Betti number β k (X ϵ ).
Definition 3 (Betti curve/Betti sequence). Let 1 ⩽ 2 ⩽ · · · ⩽ n be the equally spaced points in the parameter space. For the persistence diagram PD k for the kth persistence module The kth Betti sequence is the sequence of the numbers The BC is easy to compute. One can observe that the BC is the linear combination of the indicator functions 1 [ϵ i ,ϵ i+1 ] . Thus, the BC is a step function, and its space forms a vector space with L p -norm It follows that the BC is suitable to statistical inference and machine learning [30].
By stacking the kth BSs along the dimensions, we can represent them as a multichannel sequential data structure where each data point is a d-dimensional vector, i.e, This interpretation of the BSs as sequential data motivated us to utilize a one-dimensional CNN for analyzing BSs.

TDA-Net
A CNN-TDA Net (or simply TDA-Net) [17] is a family of neural networks that take an image and the corresponding topological features as inputs to predict its label. TDA-Net consists of three subnetworks: an image network, a TDA network (or a topology network), and a prediction head. An image network f img is a CNN that takes an image as an input and extracts a fixed-size feature vector. A TDA network f tda is a parameterized nonlinear function that encodes TDA features corresponding to the input image into a feature vector of a fixed size. That is, for the given image x img and the corresponding TDA features, x tda , we have where h img and h tda are the feature vectors with specific dimensions determined by the input size and the network architecture, respectively. Again, examples of topological features could be PI, PL, and BS. The structure of the TDA network depends on the shape of the vectorization. For example, one can use a two-dimensional CNN for PI because PI is given as a two-dimensional image and a one-dimensional CNN for PL and BS because both PL and BS are basically one-dimensional sequences of numbers. In this paper, we focus on the effects of topological features using BS, simply referred to as a CNN-BS Net. The concatenation of h img and h tda is fed into the prediction head. A typical choice for a prediction head f head is an MLP, but any nonlinear function that produces a prediction can be used: where y is a scalar for a one-dimensional regression task or a vector for a classification task. Figure 3 shows

Grad-CAM
Grad-CAM [25] is a technique for visually explaining the decision of a CNN-based model. It measures the influence of each pixel in the feature maps of the last convolutional layer on the model prediction. Let y c be the prediction of the model for a class c and A k be the kth feature map of the last convolutional layer. Then, the class-discriminative localization map Grad-CAM, which has the form of a heat map, is computed as follows: where α c k is the importance of the kth feature map. α c k is defined as the average of the partial derivative values of y c with respect to each pixel in A k : where (i, j) is the coordinate of each pixel in the feature map A k and Z is the number of pixels in A k . In (9), ReLU(x) := max(0, x) ensures that L c Grad-CAM is greater than or equal to zero. The resolution of L c Grad-CAM is the same as that of the feature maps, which is generally smaller than that of the input image. Thus, L c Grad-CAM is upsampled to L c Grad-CAM which has the same resolution as the input image by using bilinear interpolation. The visual interpretation of the model is obtained by superimposing L c Grad-CAM on the input image.

Grad-CAM on multimodal networks
To the best of our knowledge, the current work is the first approach to apply Grad-CAM to CNN-TDA Nets and analyze the role of TDA features on a CNN. The application of Grad-CAM to a multimodal network was initially implemented in [31]. Here, we first provide an elaborated formulation. We perform Grad-CAM on the last feature maps A of an image network and the last feature maps B of a TDA network, respectively, i.e.
where A k is the kth feature map of the image network and Z A k is the total number of pixels in A k . And where B n is the nth feature map of the TDA network in the one-dimensional CNN and Z B n is the total number of pixels in B n . Then, L c image and L c tda are upsampled to L c image and L c tda so that they have the same resolution as the input image and the computed BS, respectively. Figure 5 illustrates the procedure of Grad-CAM analysis on the multimodal network. In figure 5, the top-left figure shows the original input image, which has the main image source at the center in a noisy background. The middle left figure shows the heat map obtained by Grad-CAM applied to the image network and displaced on the original image after  Figure 5 shows how those heat maps for the image and topology networks are obtained in the CNN-TDA Net. As shown in the figure, the prediction head is a nonlinear function of those two feature maps from the image and topology networks. Thus, the heat map of the image network is obviously affected by the topology network.
Here we note that there are no cross-effects of both feature maps in the MLP after the concatenation of the two flattened vectors of A and B. With ReLU activation function, it is guaranteed that there is no nonlinear term between any A k ij and B n m when y c is a predicted logit (before applying the Softmax function). Hence, the partial derivative with respect to A k ij or B n m is either zero or the linear weight corresponding to each of them.

Topological inverse image analysis
In order to interpret the Grad-CAM results of a TDA network, we define the topological inverse image of a range of filtration values. Let I ∈ R h×w be the grayscale image of the height h and width w, and t a ⩽ t b be two filtration values. We define the sublevel image I (−∞,ta) ∈ R h×w of I at t a as follows: for i = 1, 2, · · · , h, and j = 1, 2, · · · , w. Then, the topological inverse image of I of the range [t a , t b ) is defined as the subtraction of I (−∞,ta) from I (−∞,t b ) , i.e.
That is, I [ta,t b ) is the image I where the region that has intensities out of the range [t a , t b ) is set to be zeros. The visual explanation of the inverse image is shown in figure 6. The left figure (a), in figure 6, is the original grayscale image of a point source with noisy background. As shown in the figure, the main object of interest is located in the image center. As the noise intensity in the background is non-negligible compared to the intensity of the main object in the center, CNN could be biased to such high-frequency noise in the  [t a , t b ). As shown in (e), the main image source is more outstanding than in (c) and (d). As shown later, the heat map by Grad-CAM indicates well the main image characteristics through the topological inverse image map.
We investigate the topological inverse image of the range near the filtration value that has the maximum in Grad-CAM, that is, the filtration value where the heat map by Grad-CAM has the warmest color. Here, the range is determined by the receptive field of the TDA network. We recall that the receptive field of a pixel in feature maps is the region in the input involved in the calculation of that pixel. Hence, we consider the topological inverse image I [ta,t b ) such that and where t * = argmax L c tda , l rf is the length of the receptive field of the TDA network, and l bc is the length of the BS. Therefore, the topological inverse image can be interpreted as a transformation from a feature map obtained from the BS and Grad-CAM into the region of the input image. While the inverse of a given BS may not be unique in general, the definition of our topological inverse image ensures that it can always be obtained when an image and two filtration values are given. As shown in the following section, Grad-CAM analysis shows that the topological inverse image identified by the heat map is well-matched with the main image of interest. This implies that the topology network is less sensitive to background noise than the image network. The reproducible codes for the experiments with the HAM1000 dataset can be found at https:// github.com/HiddenBeginner/cnntdanet_gradcam.

Experiments
In this section, we provide the results of the Grad-CAM analysis of CNN-TDA Net. Among the possible vectorization methods for persistence diagrams, we mainly focus on the BS explained in section 2. Specifically, for the following experiments, we used Giotto-TDA [32], a topological machine learning toolbox in Python, to generate topological features, and TensorFlow [33] to build and train the combined neural networks.

Dataset
We consider the grayscale image classification problem; the cubical complex explained in section 2 can be directly applied to grayscale images. We conducted experiments using those two famous image classification datasets, i.e. the transient versus bogus image dataset and the HAM10000 dataset.

Transient-vs-bogus dataset
One of the most important challenges in astrophysics today is to detect transients in given images. This problem is particularly important for the identification of the electromagnetic (EM) counterpart of a  [36] (top) and WaveNet [37] (bottom) before the classification head. (i: the layer number, k: the filter size, nc i : the number of filters, p: the dropout rate, r dilated : the rate of dilation). i Operator AvgPooling2D (k = 2) 32 4 Conv2D (k = 3) 64 5 MaxPooling2D (k = 2) 64 6 Dropout (p = 0.3) 64 7 Conv2D (k = 3) 128 8 MaxPooling2D (k = 2) 128 9 Dropout (p = 0.3) 128 10 Conv2D (k = 3) 256 11 MaxPooling2D (k = 2) 256 12 Flatten - Conv1D (k = 2) 20 4 4 Conv1D (k = 2) 20 8 5 Conv1D (k = 2) 20 1 6 Conv1D (k = 2) 20 2 7 Conv1D (k = 2) 20 4 8 Conv1D (k = 2) 20 8 9 Conv1D (k = 1) 10 -10 Flatten -detected gravitational wave [34]. More precisely, transients are astrophysical objects that appear for just milliseconds to days in their brightness in a sequence of astrophysical images in time. Transient candidates include kilonovae, supernovae, gamma-ray explosions, etc. Among the transient candidates, a kilonova is an event caused by the emission of gravitational waves from a compact binary system. Thus, identifying the kilonova as a transient is a key tool for identifying the EM counterpart of the detected gravitational wave. Meanwhile, bogus are events in the images that appear briefly as transient but are not actual transients of interest. The desired classification model is to detect the true positives (transients) and identify the false positives (bogus). In this study, we used the transient-vs-bogus dataset generated and provided by IMSNG at Seoul National University. It contains a set of 38 × 38 grayscale images composed of 307 transient images and 3424 bogus images. Notice that the considered dataset is an imbalanced dataset.

The HAM10000 dataset
The HAM10000 [35] dataset consists of 10 015 dermatoscopic images collected from different skin populations. The dataset provides 28 × 28 grayscale and RGB images, as well as raw RGB dermatoscopic images. Among them, we used 28 × 28 grayscale images for our experiments. Each image is labeled with one of the seven pigmented skin lesions. The proportion of each lesion varies from approximately 67% for the major class to 1.1% for the minor class, indicating a typical class imbalance. For both datasets, we generated the BS for each image. We used the filtered cubical complexes and computed both the zero-and one-dimensional persistence diagrams. Then, we obtained the BS with n_bins = 100, i.e. the range of filtration values is divided into 100 intervals so that the resulting BS has the form of a 100 × 2 matrix. We used 90% of the data for training and the rest for validation.

Network architecture and training configuration
We conducted Grad-CAM analysis on the CNN-BS Net. The architecture of the image network follows that of O'TRAIN [36]. O'TRAIN performs well on various grayscale images, particularly on the transient versus bogus image classification problem, and is widely used in the astrophysical image analysis community. For the topology network, we selected WaveNet [37], which is suitable for sequential data because the BS can also be regarded as sequential data, as explained in section 2.2. Table 1 shows the optimized image network of O'TRAIN architecture (top) and the topology network adopted from the architecture of WaveNet (bottom). Here, we note that all convolutional operations are applied after zero padding. The concatenation of the two flattened feature maps is fed into the classification head. The classification head is an MLP with two hidden layers with 512 and 256 neurons. More training details and learning curves are provided in appendix A. Figures 7 and 8 show the Grad-CAM analysis of the transient versus bogus image datasets. Figure 7 shows the results of the multimodal Grad-CAM analysis of the image that contains the transient in the image center. In the figure, (a) is the image that contains a single transient. As shown in the figure, the transient, a round and bright object, is located at the image center. The heat maps, those color images in the top row in columns (b) through (g), are the Grad-CAM heat maps over the single CNN (i.e. without the topology network). The heat maps in the bottom row are the Grad-CAM heat maps over the CNN-BS Net, that is, the heat maps when the topology network is combined with the original CNN. The images in column (b) are the aggregated Grad-CAM heat maps displayed over the input image, (a), with (top) and without (bottom) the topology network, respectively. Those images in column (b) show slight differences between the networks with and without the topology network. The bottom figure in Column (b) shows that the heat map is concentrated around the transient and more symmetric than that of the single CNN shown in the top. Here notice that the images in column (b) are the aggregated images of all the individual feature maps obtained by the network as indicated in (9) and (10). Thus, it is worth investigating how the heat map of each individual feature map changes when combined with the topology network. Those images in columns (c) through (g) are the heat maps of the top five feature maps that have the highest channel importance values defined in (10). As shown in the figure, the heat map of each feature map with the topology network is more centered around the transient than the heat map of the single CNN except for the 5th feature map in Column (g). Although there are differences between the single CNN and CNN-BS Net, the aggregated heat maps in column (b) indicate that both the single CNN and the CNN-BS Net capture the desired transient features reasonably. The image in column (h) is the topological inverse image I [ta,t b ) where t a and t b are defined in (17) and (18) for the length of the receptive field l rf = 32, the dimension of the BSs l bc = 100, and the filtration value t * that has the maximum in Grad-CAM on the BS shown in column (i). The blue line in the image of (i) is the zero-dimensional BS and the red the one-dimensional BS. The heat map by Grad-CAM is displayed in the x-axis. The feature image of (h) is obtained from the heat map of (i) using the topological inverse image map. As shown in the figure, the topological inverse image captures well the overall shape of the transient in the center. As shown in the following results as well, the CNN-BS Net explicitly captures the shape of the desired object through the topological inverse map, (15) and (16).

Transient-vs-bogus dataset
The difference between the single CNN and the CNN-BS Net becomes clearer in the bogus image datasets shown in figure 8. Each column represents the same as that shown in figure 7. In figure 8, we consider three different bogus images, as shown in column (a). The bogus images are different from the transient images-they are not necessarily round or symmetric due to the nature of the bogus objects in the image. None of them looks similar to the transient shown in (a) of figure 7. We observe extreme differences between the single CNN and the CNN-BS Net in the results of the bogus image datasets. The aggregated heat maps in column (b) are significantly different between the single CNN and the CNN-TDA Net. In columns (c) through (g), the top five individual feature maps are displayed. From those feature maps, we observe that the single CNN has multiple redundant and zero feature maps, even though each of those feature maps has a high channel importance value. Notably, the feature maps in columns (c),    Figure 9 shows the results for the HAM10000 dataset. The images in column (a) show two different skin lesions. The following images in columns (b) through (i) are the same as those explained in the previous example. We used the second-last feature maps of the single CNN and the image network to compute the importance of the kth feature maps, α c k , in (10) and (12) because the last feature maps have too small resolution of 3 × 3. For this example, notice that the black artifacts around the skin lesion exist in the image. The top image shows that there are sharp black artifacts near the four corners of the image. The bottom image in column (a) has the artifacts, black and broad, near the left and right boundaries. These black artifacts in both images are not the main skin lesions of our interest but just meaningless background or device artifacts. The problem is that these artifacts are structured in their shape and could be incorrectly interpreted as the main source of the skin lesion in the feature map for classification. In fact, it is well known that CNNs can be easily biased to high-frequency information or task-irrelevant artifacts in the input images [8,9]. As observed in the feature map images in columns (b) through (g) in the top rows (by the single CNN without the topology network), the artifacts are highlighted in the heat map more than in the region where the actual skin lesion of interest exits. In other words, the CNN-based model constitutes the classification network by focusing more on such artifacts. It is interesting that no feature map by the single CNN highlights the skin lesion in its heat map. In contrast, we observe that the heat map of each feature map of the CNN-BS Net in the bottom rows is totally different. Regardless of the class it predicts, the CNN-BS Net yields the heat map that highlights the skin lesions and does not exploit artifacts except the heat map in column (c) of the top image of (a). Similarly, for the bottom image of (a), the CNN-BS Net highlights the skin lesions rather than the artifacts. Only the heat maps in Columns (f) and (g) highlight the main lesion and the artifacts at a similar level. The images in column (h) show that the black artifacts were also captured by the topology network allowing the image network to focus more on the task-relevant areas-the feature map of the image network in column (b) when combined with the topology network focuses on the skin lesion in the center. This also shows that the one-dimensional Grad-CAM of the BS in column (i) yields the warmest color in the heat map around which the corresponding filtration values provide the well-matched topological inverse image map with the skin lesion towards the meaningful feature maps. The results of other vectorization methods can be found in figures B1 and B2.

Comparison of the localization ability of Grad-CAM
In order to verify our conjecture that the topology network enforces the image network to focus on the regions of interest, we calculated the localization abilities of the heat maps of both the single CNN and CNN-BS Net. We randomly sampled 100 images and manually labeled the segmentation mask for the regions of skin lesions. In figure 10, column (b) shows the segmentation masks corresponding to the images in column (a). The CNN and CNN-BS Net were trained on the remaining images except for the 100 samples. After training, we binarized the Grad-CAM of the two networks with a threshold as illustrated in columns (c) and (d) in figure 10. The images in columns (c) and (d) show the binarized Grad-CAM of the single CNN and the CNN-BS Net with the threshold value of 0.2. We then calculated the IoU score between the segmentation mask and the binarized heat map. We trained the two networks 30 times with different random seeds, binarized the heat maps with the threshold values of 0.01, 0.1, 0.2, 0.3, 0.4, and 0.5, and computed the average IoU scores for each threshold. Figure 11 shows the average IoU scores between the segmentation masks and the binarized heat maps of the single CNN (blue solid line) and the CNN-BS Net (red solid line). As shown in the figure, the CNN-BS Net, which has higher IoU scores than the single CNN, captures the regions of interest more accurately than the single CNN. For the single CNN, the average IoU score decreases linearly as the threshold increases. However, it is interesting to observe that, for the CNN-BS Net, the average IoU score changes nonlinearly. That is, the score increases with the threshold, reaches the local maximum near the threshold value of 0.2, and then decreases linearly. The fact that the IoU score increases even though the threshold increases  indicates that the CNN-BS Net could also give attention to the artifacts, but the level of the significance of attention is low, so such attention disappears quickly as the threshold value increases. This implies that the image network of the CNN-BS Net focuses on the regions of interest more efficiently than the single CNN.

Comparison with a wide CNN
We investigate whether the enhanced classification performance and localization ability of CNN-BS Net is possibly a result of the increased number of trainable parameters. To this end, we consider a wide CNN that has a comparable number of parameters to the CNN-BS Net. Specifically, we doubled the number of channels of the first four convolutional layers of the 2D-CNN presented in table 1. The number of parameters of each network can be found in table A2. As shown in figure 12, increasing the number of parameters did not enable the single CNN to focus on the region of interest, supporting our hypothesis that topology features enforce the image network of a CNN to focus on meaningful regions.

Conclusion
Recently, it has been reported that the performance of a single CNN can be significantly improved by combining it with a topology network based on persistent homology. By nature, the concatenation of a CNN and topology network constitutes a multimodal network. Despite the success of various applications, there is a lack of explanation as to how and why such topological signatures when combined with CNN improve the discriminative power of the original CNN.
In this work, we visually demonstrated the effects of topological features on CNNs using numerical experiments with two famous examples: the transient versus bogus image dataset and the HAM10000 dataset. We proposed a methodology to apply Grad-CAM analysis to multimodal networks, particularly for the combined CNN-TDA network. We used BS as topological feature. Thus the Grad-CAM was applied to the combined CNN-BS network. Furthermore, to interpret the behavior of the TDA network on the image network of CNN, we defined the topological inverse image map with a specific range composed of two filtration values. The experimental results show that the single CNN produces multiple meaningless or redundant feature maps that are biased to high-frequency artifacts or that have all zero values. Meanwhile, the topology network properly captures the shape of the object of interest in the input image and the corresponding image network focuses more on the critical regions throughout the images than the single CNN. This was verified with the topological inverse image map through the heat map obtained by Grad-CAM of the topology network.
We hypothesized that the topology network characterizes the structure of an object and helps the image network to focus on specific regions of interest more efficiently than the single CNN. We qualitatively and quantitatively verified our hypothesis with experiments, and we could understand how and why the topology network improves the performance of the original CNN. From the qualitative results via the visualization, we showed that the topological features can help the original CNN to focus on more task-relevant features of the given image than task-irrelevant artifacts or noise. Moreover, our quantitative analysis based on IoU scores revealed that the topological features not only enforce the image network to focus on the regions of interest more accurately and efficiently.
For the current work, our experiments were limited to domain-specific grayscale images. It is not known whether the CNN-TDA Net will behave similarly on more general images. Furthermore, a detailed mathematical analysis was not provided in this work. Our future work will extend our CNN-TDA Net to RGB image datasets such as CIFAR-10 or ImageNet with Grad-CAM analysis and the topological inverse image map. Furthermore, we will provide a mathematical analysis of the effects of topological features on CNN in our future work.

Data availability statement
The data cannot be made publicly available upon publication because no suitable repository exists for hosting data in this field of study. The data that support the findings of this study are available upon reasonable request from the authors. Table A1. The result of the evaluation on the HAM10000 dataset. For the multi-class setting, we calculate metrics for each label and take the average weighted by the number of true data in each class. The mean (%) and standard deviation (%) of each metric over five-fold cross-validation are reported. Maximum value for each metric is bolded.