Topological structure and global features enhanced graph reasoning model for non-small cell lung cancer segmentation from CT

Tiangang Zhang; Kai Wang; Hui Cui; Qiangguo Jin; Peng Cheng; Toshiya Nakaguchi; Changyang Li; Zhiyu Ning; Linlin Wang; Ping Xuan

doi:10.1088/1361-6560/acabff

1. Introduction

Lung cancer is common cancer worldwide and the leading cause of cancer death. There were 2.2 million new cases and 1.8 million deaths from lung cancer in 2020 (Sung et al 2021). Computed tomography (CT) is the commonly used imaging method for diagnosis, radiotherapy, and prognosis (Chen et al 2021). Usually, the tumor volume is manually delineated by experts on CT scans, however, this is a time-consuming and laborious task. Therefore, advanced image processing and learning methods are particularly important for automatic tumor segmentation from images. However, accurately segmenting tumors from CT scan images is an extremely challenging task. The size, shape, and location of lung tumors in different patients vary, hence, these tumors may have a strong texture similarity with other lung tissues. Due to the strong connection between tumor and lung, both local and global CT volume information needs to be considered to improve the segmentation accuracy.

1.1. Early-stage deep learning models for medical image segmentation

Convolutional neural network (CNN), one of the deep learning models, has been commonly applied to medical image segmentation (Pelt and Sethian 2018, Jin et al 2021), including liver tumor (Jin et al 2020) and lung tumor (Kasinathan et al 2019) segmentation tasks . One of the early methods proposed for image segmentation was a patch-based training model to classify center pixels (Ciresan et al 2012). The disadvantage of this method was that its feature representation was limited to each patch, ignoring global features (Chen et al 2019). To eliminate the limitations of the central pixel classification method, (Long et al 2015) proposed a fully convolutional neural network (FCN) to classify images at the pixel level. An earlier method of using FCN was U-Net (Ronneberger et al 2015), U-Net was composed of encoder, decoder, and skip connections to fuse multi-scale features. A deep learning-based segmentation method, nnU-Net (Isensee et al 2021), optimized network structure, reasoning strategy, etc. This method had achieved outstanding performance in many tasks. However, those methods did not consider the spatial relationships and semantic connections among distant image regions.

1.2. Lung tumor segmentation methods

The automatic segmentation of lung tumors from CT volumes (Manoharan et al 2020, Liu et al 2021) is a challenging task because tumors from different patients are characterized by various sizes, blurry borders, and irregular shapes. Kim et al (2019) proposed a CNN-based method to supplement detailed information into each layer in the encoder and decoder. Isensee et al (2021) developed an nnU-Net model based on a three-dimensional (3D) U-Net (Çiçek et al 2016), which achieved outstanding performance on datasets, including those on lung tumors in the 2018 medical image segmentation competition. He et al (2016) proposed a residual learning framework, which obtained more outstanding feature representation capabilities in deeper network structures. An attention gate-based model was proposed and integrated into the U-Net framework (Oktay et al 2018). It passed the information obtained by each layer of the encoder through the attention gate to filter more representative features. The global information among image region nodes is also important auxiliary information for tumor image segmentation. However, these methods ignored the spatial connections and global information among image region nodes.

1.3. Graph-based long-distance relational reasoning

In recent years, graph reasoning models (Chen et al 2019) had become popular. A graph structure contains the topology of long-distance image region nodes and their attributes. Graph convolutional neural network (GCN) (Meng et al 2020) can deeply integrate the topology and attributes of nodes into the graph. This network was also a graph reasoning method, which was widely used in graph classification (Ying et al 2018), link prediction (Xuan et al 2020), node-wise classification (Li et al 2020), etc. Information can be propagated along the edges of the graph and across different nodes, Since image recognition tasks can also utilize graph reasoning (Mo et al 2020). A common method was to project the features of images into a low-dimensional feature space and then established a fully connected graph for relational reasoning and node information dissemination (Chen et al 2019). Finally, the low-dimensional features obtained by reasoning were projected back to the original space. However, this operation resulted in loss of spatial relations among the nodes of the image region. Li et al (2020) directly performed graph reasoning in the original coordinate space to eliminate the projection process. However, this approach led to a reduction in the dimensions of convolutional features, causing the loss of detailed features of image region nodes. Moreover, after many instances of mutual propagation of node information in the graph, the eigenvalues of eigenvectors of each node barely changed, which is the so-called over-smoothing (Bo et al 2020). Previous GCN-based methods ignored the over-smoothing problem.

To address the above aspects, we proposed a graph convolutional autoencoder with node information supplementation (GNIS) for relation reasoning and applied it to lung tumor segmentation. The contributions of our proposed model are as follows.

(1)
We proposed a graph construction strategy to associate image regions with graph nodes. The graph edges embed the spatial correlations among these region nodes.
(2)
To facilitate graph reasoning, we developed a node-wise topological feature learning model based on graph convolutional autoencoder (GCA). Compared with conventional GCA, our new topology learning module had a node information supplementation (GNIS) module. During the topology feature learning process, graph topology structure was initially formed by image region nodes and edges. GNIS adaptively learned specific features of each region node during the training process. Compared with existing graph topology extraction approaches, GNIS can embed specific features of an image region node into each graph convolutional encoding layer as supplementary information to enhance the graph reasoning. In this way, GNIS preserved the detailed features of region nodes and decreased the risks of over-smoothing by multi-layer graph convolutional encoding.
(3)
Global features aggregated from all the image region nodes are crucial complementary information for accurate lung tumor segmentation. Thus, we proposed a new multi-layer perceptron (MLP) based module to learn the global features of image region nodes that were neglected in existing graph-based reasoning models. Comparison with other methods and ablation study results over two datasets demonstrated the improved performance and contributions of global features by MLP based module and topological structure by GNIS. In addition, we integrated our model with different segmentation backbones. The improved segmentation results showed that our model can be generalized to various image segmentation networks.

2. Method

The proposed segmentation model, TGSeg, which integrates the topology and global features of image region nodes, is shown in figure 1. Given the 3D CT volumes of lung tumor patients, nnU-Net is used as the segmentation backbone, and image features, including texture and shape features, are extracted by its encoder. After extracting the output features of the last layer of the encoder, modules based on GCN and CNN are built. The topological representation of all nodes is obtained by learning the connections among the nodes in the image region. In particular, a CNN-based convolution module is built to learn specific features of each node and alleviate the over-smoothing problem in the graph convolution process. Then these features are supplemented into the corresponding graph convolution layers. Second, a global feature representation of all nodes is formed by learning the low-dimensional features of image region nodes and aggregating their representative global features. Finally, the topological and global feature representations are deeply fused for the final segmentation.

2.1. Learning low-dimensional feature representations of image region nodes

To extract low-dimensional feature representations of 3D CT volumes, the 3D nnU-Net is used as the segmentation backbone. The 3D encoder consists of six encoding layers, each of which includes a 3D convolution (conv3D) and a stride convolution. The stride convolution is used to replace the pooling for downsampling, which can effectively avoid the loss of detailed information. Instance normalization and LeakyReLU activation function are applied after each convolution operation. The decoder consists of six decoding layers. Each decoding layer uses transposed convolution and con3D to perform upsampling, the conv3D settings in the decoding layer are the same as those in the encoding layer. Skip connections are established from the encoding layer. Then, the decoding layer can integrate detailed features of the encoding layer. Let ${\bf{F}}\in {{\mathbb{R}}}^{H\times W\times D\times C}$ denote the output feature map of the last encoding layer, where C is the number of channels, H, W, D are the height, width, and depth, respectively, and F is considered as the low-dimensional feature representation of the image region nodes of the input CT volume.

2.2. Learning topological representations based on GCA

2.2.1. Graph convolutional encoding module

Given the feature map, F, a graph, G = (V, E, X), is constructed to obtain the correlation among image region nodes, V and E are the set of nodes and edges of the constructed graph, G, respectively, and X denotes the attribute matrix of nodes. We first reshape F to $\hat{{\bf{F}}}\in {{\mathbb{R}}}^{{N}_{v}\times C}$ , where N_v = H × W × D is the number of nodes. Each node corresponds to a local image region in the CT volume of the patient. For a graph node, v_i, a node attribute vector contains a set of features, [F_i1,...,F_iC], denoted as U_i, where U_i is the ith row of ${\bf{U}}\in {{\mathbb{R}}}^{{N}_{v}\times C}$ . Given the node attribute vectors, U_i and U_j, of nodes v_i and v_j, the similarity between v_i and v_j is s_ij

$\begin{eqnarray}&&{s}_{{ij}}={{\bf{e}}}^{-\parallel {{\bf{U}}}_{i}-{{\bf{U}}}_{j}{\parallel }_{1}},\end{eqnarray} \tag{ 1 }$

where the exponential function rescales the similarity to [0, 1], s_ij is considered as the edge weight connecting nodes v_i and v_j (the closer the weight to 1, the more similar the two nodes). Matrix ${\bf{S}}=[{s}_{{ij}}]\in {{\mathbb{R}}}^{{N}_{v}\times {N}_{v}}$ can be regarded as the adjacency matrix of all image region nodes, hence, A = S. First, let $\hat{{\bf{A}}}$ =A + I, where I is the identity matrix, and $\hat{{\bf{A}}}$ is the adjacency matrix with self-loops added. Then, normalize $\hat{{\bf{A}}}$ to $\tilde{{\bf{A}}}$ by Laplace

$\begin{eqnarray}&&\tilde{{\bf{A}}}={\hat{{\bf{D}}}}^{-\tfrac{1}{2}}\hat{{\bf{A}}}{\hat{{\bf{D}}}}^{\displaystyle \frac{1}{2}},\end{eqnarray} \tag{ 2 }$

where ${\hat{D}}_{{ii}}={\sum }_{j}{\hat{{\bf{A}}}}_{{ij}}$ , and $\hat{D}$ is both a degree matrix of $\hat{{\bf{A}}}$ and a diagonal matrix. The adjacency matrix, $\tilde{{\bf{A}}}$ , and the feature matrix, ${{\bf{T}}}_{{enc}}^{l-1}$ , of the (l-1)th graph convolution encoding layer are fed to the lth encoding layer. Learning to obtain ${{\bf{T}}}_{{enc}}^{l}$

$\begin{eqnarray}&&{{\bf{T}}}_{{enc}}^{l}=f(\tilde{{\bf{A}}}{{\bf{T}}}_{{enc}}^{l-1}{{\bf{W}}}_{{enc}}^{l}),l=2,...,{L}_{{en}},\end{eqnarray} \tag{ 3 }$

where f is the LeakyReLU activation function, ${{\bf{W}}}_{{enc}}^{l}$ is the weight matrix, and L_en denotes the total number of encoding layers. The adjacency and input feature matrices of the first encoding layer are $\tilde{{\bf{A}}}$ and $\hat{{\bf{F}}}$ , respectively.

2.2.2. Encoding specific features of nodes based on CNN

The goal of the GCN encoding layer is to mutually propagate the information of graph nodes. To alleviate the over-smoothing problem, we build a CNN-based convolution module to learn the specific features of each region node. Consider the first convolutional layer as an example, the feature map ${\bf{F}}\in {{\mathbb{R}}}^{5\times 6\times 5\times 320}$ , is shown in figure 2. To obtain ${{\bf{Z}}}^{1}\in {{\mathbb{R}}}^{5\times 6\times 5\times 160}$ , N_conv = 160 1 × 1 × 1 convolutional operations on F are performed. Let ${{\bf{F}}}_{(i,j,k)}^{l}$ denote the lth channel element in F, F_(i,j,k) is a vector consisting of C channel elements at the ith row, jth column, and kth depth of F, F_(i,j,k) = $[{{\bf{F}}}_{(i,j,k)}^{1},...,{{\bf{F}}}_{(i,j,k)}^{C}]$ . Let w_t denote the tth 1 × 1 × 1 convolution kernel, the nth value in w_t indicates the weight of the nth feature. Then, the tth feature of the region node is obtained by weighted integration

$\begin{eqnarray}&&{({{\bf{Z}}}_{(i,j,k)})}_{t}={{\bf{w}}}_{t}\ast {{\bf{F}}}_{(i,j,k)},\end{eqnarray} \tag{ 4 }$

where * denotes the convolutional operation. Thus, the feature vector, ${{\bf{Z}}}_{(i,j,k)}=\left[{\left({{\bf{Z}}}_{(i,j,k)}\right)}_{1},\ldots ,{\left({{\bf{Z}}}_{(i,j,k)}\right)}_{{N}_{{conv}}}\right]$ , is obtained. By performing the above operation on all the image region nodes of F, the output, ${{\bf{Z}}}^{1}\in {{\mathbb{R}}}^{H\times W\times D\times {N}_{{conv}}}$ , is obtained, then, Z¹ is reshaped to ${\hat{{\bf{Z}}}}_{1}\in {{\mathbb{R}}}^{{N}_{v}\times {N}_{{conv}}}$ . Each convolutional layer can learn the specific features of image region nodes. We combine the representation learned by each convolutional layer with the corresponding graph convolutional layer for information propagation to retain this information to the extent possible. The integrated feature matrix of the lth graph convolutional coding layer is represented as ${\hat{{\bf{T}}}}_{{enc}}^{l}$

$\begin{eqnarray}&&{\hat{{\bf{T}}}}_{{enc}}^{l}=(1-\lambda ){{\bf{T}}}_{{enc}}^{l}+\lambda {\hat{{\bf{Z}}}}^{l},\end{eqnarray} \tag{ 5 }$

where λ is the balance factor. In this way, the convolution module and graph convolution encoding module are connected layer by layer.

**Figure 2.** Illustrate the process of learning the graph topology features and the specific ones of image region nodes based on GCA and GNIS.
Download figure:
Standard image High-resolution image

2.2.3. Graph convolutional decoding module

The last integrated feature matrix, ${\hat{{\bf{T}}}}_{{enc}}^{{L}_{{en}}}$ , is sent to the first decoding layer to obtain ${{\bf{T}}}_{{dec}}^{1}$

$\begin{eqnarray}&&{{\bf{T}}}_{{dec}}^{1}=f(\tilde{{\bf{A}}}{\hat{{\bf{T}}}}_{{enc}}^{{L}_{{en}}}{{\bf{W}}}_{{dec}}^{1}),\end{eqnarray} \tag{ 6 }$

where ${{\bf{W}}}_{{dec}}^{1}$ is the weight matrix. The output of the lth decoding layer is ${{\bf{T}}}_{{dec}}^{l}$

$\begin{eqnarray}&&{{\bf{T}}}_{{dec}}^{l}=f(\tilde{{\bf{A}}}{{\bf{T}}}_{{dec}}^{l-1}{{\bf{W}}}_{{dec}}^{l}),l=2,\ldots ,{L}_{{de}},\end{eqnarray} \tag{ 7 }$

where L_de is the total number of layers of the decoding layer, and ${{\bf{W}}}_{{dec}}^{l}$ is the weight matrix. The more similar $\hat{{\bf{F}}}$ and ${{\bf{T}}}_{{dec}}^{{L}_{{de}}}$ are, the better the segmentation results, hence, the following loss must be minimized

$\begin{eqnarray}&&{L}_{{gca}}=\parallel \hat{{\bf{F}}}-{{\bf{T}}}_{{dec}}^{{L}_{{de}}}{\parallel }_{1},\end{eqnarray} \tag{ 8 }$

2.3. Learning global feature representations based on multilayer perceptron and max pooling

Given the reshaped matrix, $\hat{{\bf{F}}}\in {{\mathbb{R}}}^{{N}_{v}\times C}$ , we build M multilayer perceptron (MLP) layers and max-pooling layers to learn the global feature representation of all image region nodes. The first MLP and max-pooling layers are considered as an example to describe the process of obtaining the global feature representation. To derive the initial dimension reduction feature matrix, ${{\bf{Y}}}^{1}\in {{\mathbb{R}}}^{{N}_{v}\times {C}^{1}}$ , of all image region nodes, $\hat{{\bf{F}}}$ is fed to the first MLP layer

$\begin{eqnarray}&&{{\bf{Y}}}^{1}=\sigma ({{\bf{W}}}_{{mlp}}^{1}\hat{{\bf{F}}}+{{\bf{b}}}_{{mlp}}^{1}),\end{eqnarray} \tag{ 9 }$

where σ is the LeakyReLU activation function, ${{\bf{W}}}_{{mlp}}^{1}$ is the weight matrix and ${{\bf{b}}}_{{mlp}}^{1}$ is the bias vector. Then, Y¹ is fed to the max-pooling layer to obtain the global feature vector $[{h}_{{global}}^{1},\ldots ,{h}_{{global}}^{{C}^{1}}]$

$\begin{eqnarray}&&[{h}_{{global}}^{1},\ldots ,{h}_{{global}}^{{C}^{1}}]=\mathop{{\bf{Max}}}\limits_{j=1,\ldots ,{C}^{1}}({{\bf{Y}}}^{1}(1:{N}_{v},j)),\end{eqnarray} \tag{ 10 }$

where ${\bf{Max}}$ is the function that obtains the maximum value. For the jth channel, the maximum value in Y¹(1: N_v, j) is obtained as h^j _global, which is the most representative feature of the current semantics in the jth channel. Let ${{\bf{h}}}_{{global}}^{1}=[{h}_{{global}}^{1},\ldots ,{h}_{{global}}^{{C}^{1}}]$ . Then, Y¹ is sent to subsequent MLP layers and max-pooling layers to obtain ${{\bf{h}}}_{{global}}^{2},\ldots ,{{\bf{h}}}_{{global}}^{M}$ , note that ${{\bf{h}}}_{{global}}^{0}$ can be obtained by max-pooling $\hat{{\bf{F}}}$ . After extracting these multi-dimension global feature vectors, they are spliced to form Y. Then, Y is copied N_v times to obtain $\hat{{\bf{Y}}}\in {{\mathbb{R}}}^{{N}_{v}\times {N}_{{gf}}}$ , where N_gf = C + C¹ + ⋯ + C^M. We choose four groups of feature numbers, including group A [320, 160, 80, 40], group B [320, 200, 100, 50], group C [320, 400, 200, 100] and group D [320, 400, 600, 800]. The experimental results are available in supplementary table 1 (ST1).

Table 1. Parameter settings of each layer of GCA for learning topology features of image region nodes.

	Operations	Output size
GCN encoder

Input		150 × 320
Encoder_1	${{\bf{T}}}_{{enc}}^{1}=f(\tilde{{\bf{A}}}\hat{{\bf{F}}}{{\bf{W}}}_{{enc}}^{1})$	150 × 160
Encoder_2	${{\bf{T}}}_{{enc}}^{2}=f(\tilde{{\bf{A}}}{{\bf{T}}}_{{enc}}^{1}{{\bf{W}}}_{{enc}}^{2})$	150 × 80

GCN decoder

Input		150 × 80
Decoder_1	${{\bf{T}}}_{{dec}}^{1}=f(\tilde{{\bf{A}}}{{\bf{T}}}_{{enc}}^{2}{{\bf{W}}}_{{dec}}^{1})$	150 × 160
Decoder_2	${{\bf{T}}}_{{dec}}^{2}=f(\tilde{{\bf{A}}}{{\bf{T}}}_{{dec}}^{1}{{\bf{W}}}_{{dec}}^{2})$	150 × 320

2.4. Multiple representations of image region node integration and optimization

The outputs of the last graph convolutional encoding layer and convolutional encoding layer are integrated to form ${\hat{{\bf{T}}}}_{{enc}}^{{L}_{{en}}}\in {{\mathbb{R}}}^{{N}_{v}\times {N}_{f}}$ , where N_f denotes the number of node features. The ith row of ${\hat{{\bf{T}}}}_{{enc}}^{{L}_{{en}}}$ is regarded as the topological representation of the ith image region node, ${\hat{{\bf{T}}}}_{{enc}}^{{L}_{{en}}}$ is renamed as R_top. The feature map, ${\bf{F}}\in {{\mathbb{R}}}^{H\times W\times D\times C}$ is reshaped to a matrix of N_v × C size and renamed as R_fea, ${({{\bf{R}}}_{{fea}})}_{i}$ is the low-dimensional feature representation of the ith image region node. Similarly, the global feature representation of the ith node is denoted as ${(\hat{{\bf{Y}}})}_{i}$ , and $\hat{{\bf{Y}}}$ is renamed as R_glo. Because R_top, R_fea, and R_glo contribute differently to the segmentation results, C 1 × 1 convolution operations are performed for them to learn adaptive weights. Let w_k represent the kth 1 × 1 convolution kernel, and ${({{\bf{w}}}_{k})}_{j}$ denote the weight of the jth feature of w_k. Then, the kth feature is obtained by weighted integration, as follows

$\begin{eqnarray}&&{({{\bf{z}}}_{j})}_{k}={{\bf{w}}}_{k}\ast {({({{\bf{R}}}_{{top}})}_{j}\parallel {({{\bf{R}}}_{{fea}})}_{j}\parallel ({{\bf{R}}}_{{glo}})}_{j}),\end{eqnarray} \tag{ 11 }$

where * denotes convolution operation, and ∥denotes the concatenation operation at the channel level. The integrated matrix, Z, is then obtained as ${\bf{Z}}=[{({{\bf{z}}}_{j})}_{1},{({{\bf{z}}}_{j})}_{2},\ldots ,{({{\bf{z}}}_{j})}_{C}]$ and reshaped back to H × W × D × C. Then, Z is fed to the nnU-Net decoder to obtain the final segmentation result. Our model is trained using Dice and cross-entropy losses. The segmentation loss, L_seg, is defined as

$\begin{eqnarray}&&{L}_{{seg}}=(1-\beta ){L}_{{Dice}}+(\beta ){L}_{{CE}}.\end{eqnarray} \tag{ 12 }$

where β is a factor that balances DICE loss and cross-entropy loss. The experimental results prove that Dice and cross-entropy losses can be used together without weighting. The experimental results are given in table 8. Because the parameters of nnU-Net, graph convolutional encoder, and graph convolutional decoder are trained together in TGSeg , the joint loss, L_joi, is defined as

$\begin{eqnarray}&&{L}_{{joi}}=(1-\alpha ){L}_{{seg}}+\alpha {L}_{{gca}},\end{eqnarray} \tag{ 13 }$

where α is a factor that balances the segmentation loss and the loss based on the graph convolution autoencoder that is usually set to 0.2. Besides, we conduct ablation studies of different weighted segmentation and graph convolutional autoencoder losses, and the experimental results are given in table 9.

Table 8. Investigations of different weighted Dice and cross-entropy losses on segmentation results.

β	Dice	IoU	HD (mm)
0.1	0.7563	0.6498	49.2359
0.3	0.7659	0.6611	30.3362
0.5	0.7696	0.6658	28.1296
0.7	0.7599	0.6532	41.6334

Table 9. Investigations of different weighted segmentation and graph convolutional autoencoder losses on segmentation results.

α	Dice	IoU	HD (mm)
0.2	0.7696	0.6658	28.1296
0.4	0.7583	0.6536	43.2659
0.6	0.7438	0.6409	65.4437
0.8	0.7321	0.6313	99.8654

3. Experiments

3.1. Datasets and parameter settings

The lung tumor dataset from the 2018 Medical Segmentation Decathlon Challenge was used to evaluate the proposed TGSeg model. The dataset contained the CT scans of 63 lung tumor patients and corresponding ground truth (GT). Each CT volume had a different voxel size, the voxel size of all CT volumes was resampled to 1.24 × 0.78 × 0.78 mm³. We performed data augmentation, including horizontal and vertical mirrors, random rotation, random scaling, gamma noise, and brightness augmentation. For these 63 samples, we randomly chose 10% as the test set. The left 56 samples were randomly divided into 10 subsets, 9 subsets as training set and 1 subset as validation set. We collected another lung tumor CT dataset from Shandong Cancer Hospital (SDCH).This dataset contained CT scans and corresponding ground truth (GT) of 227 non-small cell lung cancer (NSCLC) patients who received thoracic radiotherapy from January 2015 to June 2021. For these 227 samples, we randomly selected 20% as the test set. The remaining 181 cases were randomly divided into 5 subsets where 4 subsets are used for training and 1 for validation.

The network was implemented with PyTorch and trained on a single NVIDIA GeForce RTX 2070 graphic card with 8G memory. Adam was used as optimizer. The patch size was set to 80 × 192 × 160 and the batch size to 2. The initial learning rate was set to 0.01 and gradually decays as the training epoch increases. The decay function was initial learning rate × ${\left(1-\tfrac{{epoch}}{{\max }\_{epoch}}\right)}^{0.9}$ .

For the backbone network, the convolution kernel of conv3D in each encoding layer and decoding layer was 3 × 3 × 3, the stride was (1, 1, 1), and the padding was (1, 1, 1). The convolution kernel, stride, and padding of stride convolution were 3 × 3 × 3, (2, 2, 2), and (1, 1, 1), respectively, 2 × 2 × 2 and (2, 2, 2) were the convolution kernel and stride of the transposed convolution in the decoding layer, respectively. The detailed parameters of the proposed GCA network that learned topological representations and the MLP network that learned global feature representations were summarized in tables 1 and 2, respectively.

3.2. Performance evaluation metrics

The performance of the proposed segmentation model was evaluated by Dice, Intersection over Union (IoU), and Hausdorff distance (HD), which was sensitive to segmentation boundaries. The Dice value of a tumor can be obtained as

$\begin{eqnarray}&&{Dice}=\displaystyle \frac{2| {L}_{{pred}}\cap {L}_{{true}}| }{| {L}_{{pred}}| +| {L}_{{true}}| },\end{eqnarray} \tag{ 14 }$

where L_pred and L_true represented the segmentation results and GT, respectively. The range of Dice was [0, 1], the larger the value, the better the segmentation result. Similarly, IoU was defined as

$\begin{eqnarray}&&{IoU}=\displaystyle \frac{| {L}_{{pred}}\cap {L}_{{true}}| }{\left|{L}_{{pred}}\cup {L}_{{true}}\right|}.\end{eqnarray} \tag{ 15 }$

Let A = {a₁, ..., a_p} and B = {b₁, ..., b_q} be the segmentation result and GT tumor point set, respectively. Between A and B, HD was defined as

$\begin{eqnarray}&&H(A,B)=\max (h(A,B),h(B,A)),\end{eqnarray} \tag{ 16 }$

where h(A, B) represented the one-way HD from set A to set B

$\begin{eqnarray}&&h(A,B)=\mathop{\max }\limits_{a\in A}\ \mathop{\min }\limits_{b\in B}\parallel a-b\parallel .\end{eqnarray} \tag{ 17 }$

Similarly, h(B, A) can be obtained. The smaller the HD value, the smaller the difference between the boundaries of the segmented and GT tumors.

3.3. Ablation experiments

Ablation experiments were conducted to assess the contribution of each main part in the proposed TGSeg model. As listed in table 3, 3D nnU-Net attained Dice, IoU, and HD values of 0.7005, 0.6007, and 148.8893 mm, respectively. By considering the topology information of nodes based on the GCA and CNN, Dice, IoU, and HD improved by 2.96%, 1.65%, and 81.1213 mm, respectively. When the global information of each image region node was considered, Dice, IoU, and HD increased by 4.93%, 3.3%, and 80.2974 mm, respectively. By simultaneously considering two types of information mentioned above, the model reached the best Dice, IoU, and HD values of 0.7696, 0.6658, and 28.1296 mm, respectively. The segmentation effect was significantly better than the baseline model with Dice, IoU, and HD greater by 6.91%, 6.51%, and 120.7597 mm, respectively.

Table 3. Results of ablation studies on the major components of our method, including topological information learning (TIL) and global information learning (GIL) of image region nodes.

TIL	GIL	Dice	IoU	HD (mm)
×	×	0.7005	0.6007	148.8893
√	×	0.7301	0.6172	81.1213
×	√	0.7498	0.6337	68.5919
√	√	0.7696	0.6658	28.1296

Six examples of lung tumor segmentation results are shown in figure 3. For the cases shown in figure 3(d) and (f), nnU-Net mis-segmented the distant regions as tumors. The topological information enhanced the association reasoning and eliminated false positives in both cases. For the case shown in figure 3(e), nnU-Net was barely able to locate the tumors when the tumor region had similar morphological and textural features to the lung region. For the cases in which tumors are large, as shown in figure 3(a) and (b), nnU-Net identified tumors leakages into the lung region. In contrast, because TGSeg captured the global information of image region nodes during training, the identification of boundaries improved.

3.4. Experiments with different segmentation backbones

Experiments had been performed by embedding the proposed TGSeg model into different segmentation backbones to demonstrate its generality. These backbones included 3D ResNet, 3D nnU-Net, and 3D U-Net. The ResNet backbone was ResidualUNet provided by nnU-Net. Its GitHub URL is https://github.com/MIC-DKFZ/nnUNet. The network structure is detailed in supplementary table 2 (ST2). As summarized in table 4, the segmentation performance of different backbones significantly improves after the proposed model was embedded.

Table 2. Parameter settings of each layer within MLP for learning global features of image region nodes.

	Operations	Output size
Input		150 × 320
MLP_1	${{\bf{Y}}}^{1}=\sigma ({{\bf{W}}}_{{mlp}}^{1}\hat{{\bf{F}}}+{{\bf{b}}}_{{mlp}}^{1})$	150 × 160
MLP_2	${{\bf{Y}}}^{2}=\sigma ({{\bf{W}}}_{{mlp}}^{2}{{\bf{Y}}}^{1}+{{\bf{b}}}_{{mlp}}^{2})$	150 × 80
MLP_3	${{\bf{Y}}}^{3}=\sigma ({{\bf{W}}}_{{mlp}}^{3}{{\bf{Y}}}^{2}+{{\bf{b}}}_{{mlp}}^{3})$	150 × 40

Table 4. Segmentation performances of our models using different backbones.

	Dice	IoU	HD (mm)
3D U-Net	0.6908	0.5935	113.4141
3D U-Net+ours	0.7577	0.6474	43.9465
3D nnU-Net	0.7005	0.6007	148.8893
3D nnU-Net+ours	0.7696	0.6658	28.1296
3D ResNet	0.7141	0.6040	133.7900
3D ResNet+ours	0.7827	0.6981	32.1743

The model with 3D ResNet as the backbone exhibited the best performance. The model with 3D U-Net as the backbone had the worst performance. When the proposed model was embedded into these three backbones, Dice, IoU, and HD significantly improved. In particular, when nnU-Net was used as the backbone, Dice and HD were higher by 6.91%, and 120.7597 mm, respectively. When ResNet was used as the segmentation backbone, IoU improved by 9.41% (the highest improvement). We further analyzed the results.3D nnU-Net+ours achieved better performance than 3D U-Net+ours, the possible reason was that 3D nnU-Net performed deep supervision based on 3D U-Net, so that the lower layers were more fully trained.3D ResNet+ours achieved the best performance, mainly because the residual structure in the 3D ResNet network alleviated the problem of gradient disappearance caused by the network being too deep.

Six examples of segmentation results are shown in figure 4. When the tumor is large, as shown in figures 4(a), (b), and (f), the backbone network indicated tumor leakage into the lung region, this incorrect identification, was improved by our model. When a weak boundary problem existed between the tumor and lung, as shown in figures 4(c) and (e), all the backbones did not recognize the boundaries, the graph reasoning in our model improved this situation.

3.5. Comparison with other methods

The TGSeg model was compared with state-of-the-art (SOTA) tumor segmentation approaches to evaluate its performance on lung tumor segmentation. These approaches included (1) SCNAS (Kim et al 2019), (2) 3D U-Net (Çiçek et al 2016), (3) 3D nnU-Net (Isensee et al 2021), (4) 3D ResNet (He et al 2016), (5) Attention U-Net (Oktay et al 2018), the comparison results are summarized in table 5. The results showed that our model outperformed other methods no matter using which segmentation backbone. The best Dice value (i.e. 0.7827) was obtained when using 3D ResNet as the backbone, which was 13.45%, 9.19%, 8.22%, 6.86%, and 6.29% higher than other methods, respectively.In terms of IoU, our model using ResNet backbone obtained the greatest value 0.6981, which was 10.46%, 9.74%, 9.41%, and 7.15% higher than 3D U-Net, 3D nnU-Net, 3D ResNet, and Attention U-Net, respectively.The best HD of 28.1296 mm was obtained by our model when using 3D nnU-Net as the backbone, which was 85.2845, 120.7597, 105.6604, and 98.0611 mm higher than other methods, respectively.The smaller HD value indicated that our segmentation results were more similar to the shape of GT volumes. The SCNAS method used a neural architecture search framework to obtain auto-tuned structures for each layer of the encoder and decoder, an operation that complemented the encoding and decoding layers with more detailed information.However, this method did not consider the spatial connection among nodes, and its result was the worst. Our proposed graph convolution strategy propagated the spatial connections among the local regions of lung and tumor across multiple encoding layers. Different from the static graph, as the encoding layers gradually deepen, each coding layer was supplemented with the specific information of image region nodes. The supplementation of node information aided in more accurately learning the features of image region nodes. The Attention U-Net method used U-Net as the segmentation backbone and employed attention gates to filter the propagated features in skip layer connections, which can enhance useful features and suppress irrelevant features.It achieved the second-best result. However, this method focused more on the capture and learning of local features, and ignored the global features of all nodes.Our proposed MLP-based module extracted global features, which can learn from a global perspective and further improve the performance of the model.

The segmentation results of six cases are given in figure 5. For cases (a) (b) and (c), three baseline models incorrectly identified normal tissues as tumors. For case (d) in figure 5, U-Net, nnU-Net, and ResNet failed to segment the complete tumor with inhomogeneous intensity distributions. The three models and Attention U-Net detected non-tumor regions as tumor in cases (d), (e) and (f). Thus, using U-Net, nnU-Net, ResNet and Attention U-Net as segmentation methods will result in extra suspicious regions for second-round screening, missing tumor volumes, and wrong tumor size and shape during treatment planning. Our method achieved tumor delineation results closer to manual segmentation than other comparison methods, especially when the tumor is attached to healthy tissues and/or with inhomogeneous intensity distributions, showing more clinical benefits.

Table 5. Results by comparing our method with several state-of-the-art methods for segmenting lung tumors.

	Dice	IoU	HD (mm)
SCNAS (Kim et al 2019)	0.6482	—	—
3D U-Net (Çiçek et al 2016)	0.6908	0.5935	113.4141
3D nnU-Net (Isensee et al 2021)	0.7005	0.6007	148.8893
3D ResNet (He et al 2016)	0.7141	0.6040	133.7900
Attention U-Net (Oktay et al 2018)	0.7198	0.6266	126.1907
Ours (3D U-Net backbone)	0.7577	0.6474	43.9465
Ours (3D nnU-Net backbone)	0.7696	0.6658	28.1296
Ours (3D ResNet backbone)	0.7827	0.6981	32.1743

We performed experiments on the second dataset to further evaluate the performance of our model and other models in comparison. The experimental results are given in table 6. As shown in table 6, our model obtained the best Dice of 0.7004 which outperformed 3D U-Net by 9.86%, 3D nnU-Net by 8.86%, 3D ResNet by 5.3%, and Attention U-Net by 3.99%. Moreover, our model achieved the highest IoU of 0.5704, which was 8.34%, 7.38%, 4.82%, and 2.46% greater than 3D U-Net, 3D nnU-Net, 3D ResNet,and Attention U-Net. Meanwhile, our model achieved the best HD of 64.4661 mm.

Table 6. Comparative experimental results with other methods on the SDCH NSCLC dataset.

	Dice	IoU	HD(mm)
3D U-Net	0.6018	0.4870	105.5996
3D nnU-Net	0.6118	0.4966	99.3094
3D ResNet	0.6474	0.5222	80.6390
Attention U-Net	0.6605	0.5458	73.4885
Ours (3D U-Net backbone)	0.6667	0.5463	71.4243
Ours (3D nnU-Net backbone)	0.6826	0.5570	69.6529
Ours (3D ResNet backbone)	0.7004	0.5704	64.4661

The segmentation results of six cases are given in figure 6. Overall, our proposed TIL and GIL modules consistently improved the segmentation results of different segmentation backbones. For cases (b), (d), and (f) in figure 6, 3D U-Net, 3D nnU-Net and 3D ResNet barely identified any lung tumor regions. Our model successfully identified lung tumors due to the enhanced spatial connections between nodes in the image region and integrated global information, especially when 3D ResNet was used as the backbone. For cases (c) and (e) in figure 6, 3D U-Net, 3D nnU-Net, 3D ResNet and Attention U-Net models wrongly detected lung tissue as a tumor.Our model successfully distinguished tumor from non-tumor regions with similar patterns. In conclusion, our model outperformed other comparing models on both datasets.

3.6. Segmentation results of tumors of different sizes

To investigate the segmentation performance over tumors of different sizes, we analyzed the 46 test cases from the SDCH NSCLC dataset.The tumor sizes ranged from 1.80 to 259.60 cm³. We divided them into 5 groups and calculated the average segmentation effects for each group.

As shown in table 7, overall, the larger the tumor size, the better the segmentation results. When the tumor size was between 210.14 and 259.60 cm³, the segmentation results were the best, as demonstrated by the highest Dice of 0.7861. When the tumor size was smaller than 38.78 cm³, the Dice were around 0.67. Compared with large tumors, the segmentation of small tumors is more complicated. Because when the tumor is small, the model has a high chance of missing the tumor regions, or mistaking other lung tissue for the tumor. For example, the case in figure 6(e), all the other methods misidentified a neighbouring non-tumor region as tumor excepted our models with nnU-Net and ResNet backbones. Another potential reason is that the computation of overlap-based metrics such as Dice and IoU are affected by the segment size (Taha and Hanbury 2015). Previous research has shown that for small tumors, the misalignment can be magnified compared to large tumors, resulting in smaller Dice and IoU values.

Table 7. Segmentation results over tumors of different sizes.

Tumor size (cm³)	Number of cases	Dice	IoU	HD (mm)
1.80–18.88	18	0.6734	0.5313	65.2677
23.04–38.78	6	0.6747	0.5482	63.7591
42.03–73.56	13	0.7014	0.5861	70.9789
116.47–194.64	5	0.7577	0.6248	55.1649
210.14–259.60	4	0.7861	0.6606	52.3790

3.7. Investigation of different weighting factors

We evaluated the segmentation performance using different combinations of Dice loss L_Dice and cross-entropy loss L_CE in equation (12). Let L_seg = (1 − β)L_Dice + (β)L_CE, the experimental results using β ∈ {0.1, 0.3, 0.5, 0.7} are given in table 8.

As shown in table 8, the best result was obtained when β was 0.5, indicating that Dice and cross-entropy losses both contribute to better segmentation results and can be combined using equal weights in our studies.

We also conducted experiments on the impact factor α in equation (13) which balanced the segmentation loss L_seg and the graph convolutional autoencoder-based loss L_gca. Using α ∈ {0.2, 0.4, 0.6, 0.8}, the experimental results are given in table 9.

As shown in table 9, with the increase of α, the segmentation performance continued to decline, this is because L_gca was an auxiliary loss to help the graph convolutional autoencoder learn more accurate features, and the segmentation loss L_seg was more important. Thus, we set α as 0.2 empirically.

4. Conclusion

We presented a model which integrates the specific features and global ones of image region nodes for segmenting lung tumors. The constructed graph composed of all the image region nodes is beneficial for encoding the spatial dependencies among these nodes and their attributes. The encoding process of graph convolution also fused the specific features of image region nodes. The cross-validation confirmed the powerful segmentation ability of our model. The ablation experiments and the segmentation instances with different backbones demonstrated the advantages of major innovations and the generality of our model. Our future work includes improving the segmentation performance on smaller and more challenging NSCLC cases.

Acknowledgments

This work is supported by the Natural Science Foundation of China (61972135, 62 172 143, 82172865, 62201460), the STU Scientific Research Initiation Grant (NTF22032), the Natural Science Foundation of Heilongjiang Province (LH2019F049), and the China Postdoctoral Science Foundation (2019M650069, 2020M670939).

All authors contributed to the article and approved the submitted version.

Credit authorship contribution statement

Tiangang Zhang: Participated in method design and manuscript writing. Kai Wang: Designed the experiments and edited the manuscript. Hui Cui: Participated in method design and manuscript writing. Qiangguo jin: Participated in method design. Peng Cheng: Participated in experiment design. Toshiya Nakaguchi: Participated in experiment design. Changyang Li: Participated in experiment design. Zhiyu Ning: Participated in experiment design. Linlin Wang: Provided the second dataset. Ping Xuan: Designed the method and participated in manuscript writing.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Topological structure and global features enhanced graph reasoning model for non-small cell lung cancer segmentation from CT

Article metrics

Permissions

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Abstract

1. Introduction

1.1. Early-stage deep learning models for medical image segmentation

1.2. Lung tumor segmentation methods

1.3. Graph-based long-distance relational reasoning

2. Method

2.1. Learning low-dimensional feature representations of image region nodes

2.2. Learning topological representations based on GCA

2.2.1. Graph convolutional encoding module

2.2.2. Encoding specific features of nodes based on CNN

2.2.3. Graph convolutional decoding module

2.3. Learning global feature representations based on multilayer perceptron and max pooling

2.4. Multiple representations of image region node integration and optimization

3. Experiments

3.1. Datasets and parameter settings

3.2. Performance evaluation metrics

3.3. Ablation experiments

3.4. Experiments with different segmentation backbones

3.5. Comparison with other methods

3.6. Segmentation results of tumors of different sizes

3.7. Investigation of different weighting factors

4. Conclusion

Acknowledgments

Credit authorship contribution statement

Declaration of competing interest

Topological structure and global features enhanced graph reasoning model for non-small cell lung cancer segmentation from CT

Article metrics

Permissions

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Abstract

1. Introduction

1.1. Early-stage deep learning models for medical image segmentation

1.2. Lung tumor segmentation methods

1.3. Graph-based long-distance relational reasoning

2. Method

2.1. Learning low-dimensional feature representations of image region nodes

2.2. Learning topological representations based on GCA

2.2.1. Graph convolutional encoding module

2.2.2. Encoding specific features of nodes based on CNN

2.2.3. Graph convolutional decoding module

2.3. Learning global feature representations based on multilayer perceptron and max pooling

2.4. Multiple representations of image region node integration and optimization

3. Experiments

3.1. Datasets and parameter settings

3.2. Performance evaluation metrics

3.3. Ablation experiments

3.4. Experiments with different segmentation backbones

3.5. Comparison with other methods

3.6. Segmentation results of tumors of different sizes

3.7. Investigation of different weighting factors

4. Conclusion

Acknowledgments

Credit authorship contribution statement

Declaration of competing interest