Chapter 8

Data augmentation for deep ensembles in polyp segmentation


Published Copyright © IOP Publishing Ltd 2022
Pages 8-1 to 8-22

Download ePub chapter

You need an eReader or compatible software to experience the benefits of the ePub3 file format.

Download complete PDF book, the ePub book or the Kindle book

Export citation and abstract

BibTeX RIS

Share this chapter

978-0-7503-4821-8

Abstract

Chapter 8 offers multiple data augmentation approaches for boosting segmentation performance by utilizing DeepLabv3+ as the architecture and ResNet18/ResNet50 as the backbone.

This article is available under the terms of the IOP-Standard Books License

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the publisher, or as expressly permitted by law or under terms agreed with the appropriate rights organization. Multiple copying is permitted in accordance with the terms of licences issued by the Copyright Licensing Agency, the Copyright Clearance Centre and other reproduction rights organizations.

Permission to make use of IOP Publishing content other than as set out above may be sought at permissions@ioppublishing.org.

Certain images in this publication have been obtained by the authors from the Pixabay website, where they were made available under the Pixabay License. To the extent that the law allows, IOP Publishing disclaims any liability that any person may suffer as a result of accessing, using or forwarding the images. Any reuse rights should be checked and permission should be sought if necessary from Pixabay and/or the copyright owner (as appropriate) before using or forwarding the images.

This book contains photographs of identifiable persons. Consent to print these photographs has been obtained by the authors from the individuals.

Varun Bajaj and Irshad Ahmad Ansari have asserted their right to be identified as the authors of this work in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988.

The last few years have witnessed a growing interest in semantic segmentation among computer vision researchers. Semantic segmentation in general, learns low-level features and the semantics of an image via an encoder–decoder structure. In brief the task of the encoder is to extract features by exploiting convolutional layers; the job of the decoder is to generate the same image by applying skip connections in the first layer. This chapter aims to test different data augmentation approaches for boosting segmentation performance by taking DeepLabv3+ as the architecture and ResNet18/ResNet50 as the backbone. The proposed set of data augmentation approaches is coupled with an ensemble of networks obtained by randomly changing the activation functions inside the network multiple times. Moreover, the proposed approach is combined with HardNet-SEG, a recent architecture for semantic segmentation, for a further boost of the performance.

8.1. Introduction

Semantic segmentation applies class labels to pixels within an image containing different objects, such as people, trees, chairs and dogs. This technique has proven highly useful in many tasks, such as autonomous driving [1] and medical diagnosis from medical images [2]. As is the case in many fields, contemporary research in semantic segmentation has begun exploring the possibilities offered by deep learning. The architectural design of one of the earliest semantic segmentation networks (U-Net [3]) included what is now a widely used encoder–decoder structure. A drawback of U-Net is its inability to classify the borders of objects, a problem that was fixed by applying skip connections in the decoder. Most contemporary segmentation networks are now modeled on the encoder–decoder architecture [46] because of their success in advancing the field across many computer vision tasks.

In this chapter we explore semantic segmentation using DeepLabv3+ and test our system on the critical problem of colorectal cancer segmentation. Early detection of colorectal cancer is essential for good outcomes. Because there is a strong correlation between polyps and the eventual development of colorectal cancer, polyps need to be detected and removed as early as possible [7]. The detection of polyps is difficult, however, even for seasoned experts, primarily because the edges of polyps are often occluded and highly similar to the surrounding mucosa. Moreover, without being able to detect polyps, classifying them is impossible. There are five common types of polyps: (i) adenomatous, (ii) serrated, (iii) hyperplastic, (iv) tubulovillous adenoma and (v) inflammatory. Adenomas and serrated polyps are the most dangerous and challenging to detect. Automatic polyp detection could augment recognition.

Segmentators based on traditional classifiers succeeded in matching human experts in polyp segmentation long ago [2, 810]. In [11] the authors compared convolutional neural networks (CNNs) to the performance of classical classifiers, demonstrating the superiority of CNNs. The power of deep learning in segmentation was illustrated in a system proposed in [12] that used CNNs. This segmentator placed first and second in the 2017 and 2018 Gastrointestinal Image ANAlysis (GIANA) contests. It is well known that deep learners, such as CNNs, require large datasets to generalize well. Only recently have large datasets for colorectal cancer detection become available to researchers [2, 8]. For example, Jha et al [13] recently proposed a new public polyp dataset called Kvasir-SEG that contains 1000 polyp images annotated at the pixel level by expert endoscopists at Oslo University Hospital. Jha et al [13] also trained a segmentator based on ResNet and U-Net on this novel dataset that produced some very promising results.

A very recent revolution in the field of deep learning is given by transformers. The transformer was originally developed in 2017 for natural language processing [14], and it has also attracted enormous interest recently in the computer vision field. The network employs an encoder–decoder structure such as RNN, the difference is that the input sequence can be passed in parallel. A transformer is a self-attention-based architecture, where the attention mechanism allows it to focus in 'high resolution' on a certain part of the input while the rest of the input is in 'low resolution'. These models usually consist of a two-step training procedure: pre-training on a large dataset and then fine-tuning on a smaller dataset, specific for the application [15]. The application of transformers to image computer vision tasks is based on image splitting into patches [16] to better deal with the large number of pixels of which the image is composed and the fact that the attention is a quadratic operation. Then a linear transformation is applied and position embeddings are added. The vectors obtained in this way are the input to a standard transformer encoder. The model TransFuse [17] combines CNNs and transformers, where a CNN uses kernels to aggregate local information in each layer, while a transformer captures global context information. A new method for image segmentation is UACANet [18], which uses U-Net as backbone, a parallel axial attention encoder, and a decoder to obtain both local and global information and a self-attention mechanism.

In this work the aim is to improve the performance of a segmentation approach based on DeepLabV3+ using ResNet18 and ResNet50 as a backbone. We increase the training data size using different data augmentation approaches. These augmented training sets are fed into an ensemble of networks whose activation functions are randomly selected from a pool of ReLU variants. To increase diversity, each network is stochastically designed by varying the activation layers. A collection of the most diverse networks is selected to be trained on the augmented training sets, with the best performing networks chosen for inclusion in the final ensemble. For computational reasons, the selection procedure has been carried out for the lighter ResNet18 architecture, while pure random generation has been performed for ResNet50. Empirical results using two different testing protocols and five datasets (Kvasir-SEG, CVC-ColonDB, EndoScene, ETIS-Larib Polyp DB and CVCClinic DB) show that the proposed ensemble gains performance which is better than all models based on CNN and is almost comparable with the actual state-of-the-art transformer approaches.

8.2. Deep learning for semantic image segmentation

As stated in the introduction, semantic segmentation is the process whereby class labels of objects represented in an image are applied to each of the pixels composing the objects. One of the first applications of deep learning to the problem of semantic image segmentation employed a fully convolutional network (FCN), which replaces the last fully connected layers of a network with a fully convolutional layer so that the network can classify an image on the pixel level [6]. A significant advancement in semantic segmentation was the insertion of an encoder–decoder unit [3] into the FCN architecture that enabled a multi-layer deconvolution network to be learned. U-Net, so named because it is a U-shape network, is another architecture that has proven successful in semantic image segmentation. The function of the decoder in U-Net is to downsample the image and increase the number of features; the task of the encoder is to increase the image resolution to match the input size [19]. Another encoder–decoder architecture that produces excellent results is SegNet [4], which uses VGG [20] as a backbone encoder; the decoder utilizes max-pooling indices from the corresponding encoder layer rather than concatenating them as is the case with U-Net. As a result, SegNet requires less memory while producing better boundary reconstructions.

Another step in advancing image segmentation is the semantic segmentation model designed by Google called DeepLab [21]. This model attains dense prediction by up-sampling the output of the last convolution layer with atrous convolution and by computing pixel-wise loss. Atrous convolution applies a dilation rate to enlarge the field of view of filters without adding parameters that increase computational resources. The superiority of the DeepLab family of segmentators [2124] is the result of three key features: (i) dilated convolutions that avoid the decrease in resolution caused by the pooling layers and large strides, (ii) atrous spatial pyramid pooling, which uses filters with multiple sampling rates to retrieve information from the image at different scales, and (iii) a better method for localizing object boundaries that couples convolutional networks with probabilistic graphical models. DeepLabV3 enhanced previous versions of DeepLab by combining cascade and parallel modules of dilated convolutions and by modifying the atrous spatial pyramid pooling with the addition of batch normalization and a 1 × 1 convolution. The output is given by a final layer that is also a 1 × 1 convolution; DeepLabV3 outputs the probability distribution of the classes over every pixel. Finally, DeepLabV3+ [24], the architecture used in this work, extends DeepLabV3 by combining cascaded and parallel modules of dilated convolutions. DeepLabV3+ includes (i) a decoder with point-wise convolutions that operate on the same channel but at different locations and (ii) depth-wise convolutions that operate at the same location but on different channels.

Many other architectures have also been proposed for image segmentation, with the most popular based on recurrent neural networks, attention models and generative models. For a recent survey of the literature see [25].

Aside from architectural considerations, segmentation performance is affected by other design choices, such as the selection of the pretrained backbone used for the encoder. Among the many CNN [26] architectures widely used for transfer learning, in this work we explore ResNet18 and ResNet50 [27]. These CNNs use residual blocks whose intermediate layers learn a residual function concerning the block input.

In addition, different loss functions affect how a network is trained. For image segmentation, the most popular is pixel-wise cross-entropy loss, which treats classification as a multi-label problem where the prediction of each pixel is individually compared with the actual class label. Pixel-wise cross-entropy is calculated as the log loss summed over each class and averaged over each pixel. Unfortunately, in the case where classes are unbalanced, training with this loss function can favor the most prevalent class. One method for handling this problem is to apply counterweights to offset any imbalance in the dataset [6].

Dice loss [28], based on the Sørensen–Dice similarity coefficient for measuring the overlap between two segmented images, is yet another common loss function and one that is explored here. Dice loss ranges from 0 to 1, with 1 indicating perfect overlap. The reader is referred to [28] for an overview of other popular loss functions for image segmentation.

Finally, activation functions have a significant impact on the performance of CNN. ReLU is one of the most powerful nonlinearities, but there is a growing body of literature demonstrating improved performance using ReLU variants [29]. We take advantage of these performance differences to add diversity to an ensemble of networks. Our proposed method for replacing activation layers is detailed in the next section.

8.3. Stochastic activation selection

Introduced in [29], stochastic activation selection takes a specific neural network topology and generates a diverse set of them by randomly selecting different activation functions for each network. The random selection is iterated multiple times to generate a large number of networks with maximum diversity. After training each network on the same training set, the results are combined by the sum rule (i.e. by averaging the softmax outputs of each network composing the ensemble).

Given DeeplabV3+ [24] as the neural network architecture, the pool of activation functions is composed of classic ReLU [30] and some of the best ReLU variants including Leaky ReLU [31], ELU [32], PReLU [33], S-Shaped ReLU (SReLU) [34] and many others. The interested reader can find the whole set of activation functions with a detailed mathematical explanation in [35] (see the code on GitHub for more details).

8.4. Data augmentation

This section describes some image augmentation techniques, created with the aim of increasing the size of the starting dataset and therefore increasing the performance of the classification system. We use both shape-based and color-based transformation: in the first case, the augmentation technique is applied to both the training images and their labels. No test set augmentation is performed.

8.4.1. Spatial stretch

The transformation consists of stretching the original image. The stretching direction is randomly chosen from the four possible options: left/up, right/up, left/down and right/down. The stretching is applied to the columns first, then the empty spaces are interpolated with a weighted nearest neighbor method, and the same procedure is repeated for the rows. The stretching of the original image is obtained by changing the position of the rows/columns according to the following equations:

where s is the number of columns and rows (in this work both equal 224), t[i] is a vector containing the numbers [0, s−1], k[i] is the distance between the row/column of the original image and the final image, while y[j] is the final row/column which varies according to j, the original position. The function U(a, b) returns a random number in the range a, b. An example of the application of the spatial stretch technique to an image of the training set and its mask is shown in figure 8.1.

Figure 8.1.

Figure 8.1. Example of application of the spatial stretch technique.

Standard image High-resolution image

8.4.2. Shadows

The final image is obtained by applying a shadow [36] to the left or the right of the original image, as shown in figure 8.2. In particular, the intensities of the columns are multiplied by the following equation:

Figure 8.2.

Figure 8.2. Example of application of the shadows technique (no changes are needed in the mask).

Standard image High-resolution image

8.4.3. Contrast and motion blur

This transformation [36] is the composite of two alterations: first, it is necessary to modify the contrast of the original image, increasing or decreasing it, then a filter that simulates the movement of the camera is applied. Two functions to modify the contrast are implemented, but only one randomly chosen between the two is applied to the image.

The first contrast function (figure 8.3) is based on the following equation:

Figure 8.3.

Figure 8.3. Example of application of the contrast function 1 for decreasing values of k and subsequent use of the motion blur filter.

Standard image High-resolution image

The parameter k controls the contrast, there is an increase in contrast if k < 0, a decrease in contrast if 0 < k ⩽ 4, the image remains the same when k = 0.

In the code k is chosen randomly from a specific range, among the following four:

  • U(2.8, 3.8) → Hard decrease in contrast.
  • U(1.5, 2.5) → Soft decrease in contrast.
  • U(−2, −1) → Soft increase in contrast.
  • U(−5, −3) → Hard increase in contrast.

The second contrast function (figure 8.4) is based on the following equation:

Figure 8.4.

Figure 8.4. Example of application of the contrast function 2 for increasing values of α and subsequent use of the motion blur filter.

Standard image High-resolution image

The parameter α controls the contrast, there is an increase in contrast if α > 1, a decrease in contrast if 0 < α < 1 and the image remains the same when α = 1. This parameter is chosen randomly from four possible ranges:

  • U(0.25, 0.5) → Hard decrease in contrast.
  • U(0.6, 0.9) → Soft decrease in contrast.
  • U(1.2, 1.7) → Soft increase in contrast.
  • U(1.8, 2.3) → Hard increase in contrast.

8.4.4. Color change and rotation

This technique consists of three color operations (color adjusting, blurring and adding Gaussian noise) and a spatial operator (image rotation). The jitterColorHSV (I, Name, Value) function is used for color adjusting by randomly changing saturation, contrast and brightness levels within a random range:

  • Saturation → [−0.3–0.1].
  • Contrast → [−0.3–0.1].
  • Brightness → [1.2 1.4].

The imgaussfilt function is used for blurring and the imnoise function is used for adding Gaussian noise. The image is finally rotated by an angle in the interval [−90° 90°] (figure 8.5).

Figure 8.5.

Figure 8.5. Result of the application of the 'color change and rotation' transformation.

Standard image High-resolution image

8.4.5. Segmentation

This technique consists of the segmentation of the image based on three different colors. The image obtained in this way is divided into three images, each of these containing a different color. The two images with the highest number of black pixels are selected, then the third image with the highest brightness is added to these. The images are combined and rotated by an angle selected from the range [−90°, 90°] (figure 8.6).

Figure 8.6.

Figure 8.6. Example of the segmentation technique.

Standard image High-resolution image

8.4.6. Rand augment

This approach consists of the random selection of two transformations (one color-based and one shape-based) among a set of 21 [37].

Thirteen transformations belong to the color category:

  • Create a composite image from two images A and B, by using the function imfuse.
  • Add Gaussian noise.
  • Adjust the saturation.
  • Adjust the brightness.
  • Control the contrast.
  • Adjust the sharpness by using the function imsharpen.
  • Application of the motion filter to blur the image.
  • Histogram equalization.
  • Conversion from RGB to YUV and histogram equalization.
  • Application of the disk filter to blur the image.
  • Add salt-and-pepper noise by using the function imnoise.
  • Convert from RGB to HSV and adjust the hue by using the function jitterColorHSV(I,Name,Value).
  • Adjust the local contrast by using the function localcontrast.

Eight transformations belong to the shape category:

  • Rotation of an angle selected from the range [−40°, 40°].
  • Vertical flip.
  • Horizontal shear by using the function randomAffine2d.
  • Vertical shear by using the function randomAffine2d.
  • Horizontal translation by using the function randomAffine2d.
  • Vertical translation by using the function randomAffine2d.
  • Uniform scaling.
  • Cutout by using the function imcrop.

The Rand augment technique (figure 8.7) consists of the application of a transformation from the color category with probability 1 and the subsequent application of a technique from the shape category with a probability of 0.3.

Figure 8.7.

Figure 8.7. Example of the Rand augment technique.

Standard image High-resolution image

8.4.7. RICAP

The RICAP transformation [38] consists of three steps:

  • Four images are randomly selected from the training set.
  • The selected images are cropped.
  • The cropped images are patched together to construct a new image (figure 8.8).

Figure 8.8.

Figure 8.8. Example of the RICAP technique.

Standard image High-resolution image

8.4.8. Color and shape change

Ten artificial images are created (figure 8.9) for each image in the dataset [39]. The operations performed are:

  • The image is displaced to the right or the left.
  • The image is displaced up or down.
  • The image is rotated by an angle randomly selected from the range [0°, 180°].
  • Horizontal or vertical shear is applied by using the function randomAffine2d.
  • A horizontal or vertical flip is applied.
  • Change the brightness levels by adding the same value to each RGB channel.
  • Change the brightness levels by adding different values to each RGB channel.
  • Add speckle noise by using the function imnoise.
  • Application of the technique 'contrast and motion blur', described previously.
  • Application of the technique 'shadows', described previously.

Figure 8.9.

Figure 8.9. Ten examples of the application of the technique 'color and shape change' to a given image.

Standard image High-resolution image

8.4.9. Occlusion 1

Three occlusion methods are used. The first technique consists in selecting some rows (or columns) and replacing them with black lines (figure 8.10). The distance between the lines in terms of pixels is chosen randomly from the interval U(15, 30).

Figure 8.10.

Figure 8.10. Example of the application of the occlusion 1 technique.

Standard image High-resolution image

8.4.10. Occlusion 2

In the second occlusion technique the images are transformed by replacing some parts of them with black rectangles (figure 8.11). The size of the rectangles, their number and their position are chosen randomly from the following intervals:

  • Length of the rectangle nU(4, 14).
  • Height of the rectangle mU(4, 14).
  • Number of rectangles → U(7, 14).
  • Coordinate xU(1, 224 – n).
  • Coordinate yU(1, 224 – m).

Figure 8.11.

Figure 8.11. Example of the application of the occlusion 2 technique.

Standard image High-resolution image

8.4.11. GridMask

Given an input image, some of its pixels are removed, as shown in figure 8.12. The equation used is y = x × M, where x is the input image, M is a binary mask that stores pixels to be removed, while y is the final image.

Figure 8.12.

Figure 8.12. Example of the GridMask technique.

Standard image High-resolution image

8.4.12. AttentiveCutMix

For each image x1 of the training set an image x2 from the original dataset is chosen randomly. The image x2 is divided into a 7 × 7 grid, and N elements are cut from this grid. The elements cut from the second image are pasted on top of the first in their original position, as in figure 8.13. The equation used [40] is

where y is the result of the algorithm,while B is a binary mask (figure 8.13).

Figure 8.13.

Figure 8.13. Example of application of the AttentiveCutMix technique.

Standard image High-resolution image

8.4.13. Modified ResizeMix

For each image A of the training set, a second image B is selected randomly, resized and pasted over image A [41].Then the technique 'color change and rotation' is applied (figure 8.14).

Figure 8.14.

Figure 8.14. Example of the 'modified ResizeMix' technique.

Standard image High-resolution image

8.4.14. Color mapping

This technique consists of randomly selecting an image B from the training set for each original image A. Three methods of color normalization of image A versus image B are applied (figure 8.15), using the Stain Normalization toolbox by Nicholas Trahearn and Adnan Khan [42]:

  • RGB histogram specification.
  • Reinhard.
  • Macenko.

Figure 8.15.

Figure 8.15. Example of application of the 'color mapping' technique.

Standard image High-resolution image

8.5. Results on colorectal cancer segmentation

8.5.1. Datasets, testing protocol and metrics

We have performed the experiments following two different protocols, which are used widely in the literature.

The first testing protocol [13, 43] uses the Kvasir-SEG dataset [44] partitioned into 880 images for training and the remaining 120 for testing. The Kvasir-SEG dataset includes 1000 polyp images acquired using a high-resolution electromagnetic imaging system (the image sizes vary between 332 × 487 and 1920 × 1072 pixels), with a ground-truth consisting of bounding boxes and segmentation masks.

The second testing protocol exploits five polyp datasets, namely Kvasir-SEG, CVC-ColonDB, EndoScene, ETIS-Larib Polyp DB and CVCClinic DB. The CVC-ClinicDB [45] and ETIS-Larib [46] datasets consist of frames extracted from colonoscopy videos, annotated by expert video endoscopists, and include 612 images (384 × 288) and 196 high-resolution images (1225 × 966), respectively. The CVC-ColonDB [47] dataset contains 380 images (574 × 500) representing 15 different polyps. EndoScene is a combination of CVC-ClinicDB and CVC300. The training set for the second testing protocol includes 900 images from Kvasir-SEG and 550 images from CVC-ClinicDB. The testing set is made up of the remaining images from the above-cited datasets: 100 images from Kvasir-SEG, 62 images from CVC-ClinicalDB, 380 from CVC-ColonDB, 196 from ETIS-Larib and 60 from CVC-T, which is the testing set for EndoScene. Note that only a small set of images are taken from EndoScene to avoid the inclusion of images already seen in the training stage.

All the datasets for both protocols can be downloaded from the GitHub of [43].

In all experiments, even when the images are resized to the input size of a CNN model, the predicted masks are resized back to the original dimensions before performance evaluation.We do not include other approaches in the comparison that evaluated the performance on the resized version of the images.

In accordance with most of the works on polyp segmentation we use pixel-wise metrics as performance indicators: accuracy, precision, recall, F1-score, F2-score, intersection over union (IoU) and Dice. A mathematical definition of each indicator for a bi-class problem (foreground/background), starting from the confusion matrix (TP, TN, FP and FN refer to the true positives, true negatives, false positives and false negatives, respectively), is the following:

which is the number of pixels correctly classified over the total number of pixels in the image,

which is the fraction of the polyps that are correctly classified,

which is the fraction of the predicted mask that is actually polyp pixels,

which are two measures that try to average precision and recall,

which is defined as the area of intersection between the predicted mask A and the ground-truth map B, divided by the area of the union between the two maps, and

which is defined as twice the overlap area of the predicted and ground-truth masks divided by the total number of pixels (it coincides with the F1-score for binary masks)

All the above metrics range in [0,1] and must be maximized. The final performance is obtained by averaging on the test set the performance obtained for each test image.

8.5.2. Experiments

The first experiment in table 8.1 is aimed at comparing the two different backbone networks: ResNet18 and ResNet50. Since the size of images in the Kvasir dataset is quite large we also evaluate versions of the ResNet with larger input size, i.e. 299 × 299 (ResNet18–299/ResNet50–299) and 352 × 352 (ResNet18–352/ResNet50–352). Clearly, a larger input size improves the performance.

Table 8.1.  Experiments with different backbones (first protocol).

Backbone IoU Dice F2 Prec. Rec. Acc.
ResNet18 0.759 0.844 0.845 0.882 0.856 0.952
ResNet50 0.751 0.837 0.836 0.883 0.845 0.952
ResNet18–299 0.782 0.863 0.870 0.881 0.883 0.959
ResNet50–299 0.798 0.872 0.876 0.898 0.886 0.962
ResNet18–352 0.787 0.865 0.871 0.891 0.884 0.960
ResNet50–352 0.801 0.872 0.884 0.881 0.900 0.964

The training of all the models has been performed with the SGD optimizer for 20 epochs, with an initial learning rate of 10e-2, a learning rate drop period of 5 epochs and a drop factor of 0.2. In this experiment we use the same 'base' data augmentation of our previous paper [48], i.e. horizontal and vertical flip, 90° rotation.

The second experiment (table 8.2) is aimed at comparing the different data augmentation approaches proposed in this work. For the sake of computation time all the tests are performed using ResNet18 with an input size of 224 and compared to the 'base' data augmentation reported in the first rows of tables 8.1 and 8.2. The other rows of table 8.2 include the performance obtained using the different data augmentation approaches described in section 8.4 and the last three rows report the performance obtained by evaluating a combination of several data augmentation approaches.

Table 8.2.  Performance of the different data augmentation approaches (first protocol).

Data augmentation IoU Dice F2 Prec. Rec. Acc.
ResNet18 0.759 0.844 0.845 0.882 0.856 0.952
Spatial stretch 0.732 0.822 0.824 0.868 0.838 0.948
Shadows 0.748 0.836 0.842 0.868 0.858 0.952
Contrast and motion blur 0.749 0.836 0.843 0.857 0.860 0.952
Color change and rotation 0.761 0.849 0.852 0.874 0.863 0.953
Segmentation 0.758 0.846 0.861 0.854 0.883 0.953
Rand augment 0.754 0.841 0.846 0.870 0.858 0.952
RICAP 0.745 0.834 0.835 0.876 0.847 0.949
Color and shape change 0.803 0.874 0.883 0.885 0.900 0.961
Occlusions 1 0.747 0.834 0.846 0.855 0.867 0.954
Occlusions 2 0.737 0.825 0.830 0.852 0.847 0.951
GridMask 0.746 0.836 0.841 0.869 0.855 0.949
AttentiveCutMix 0.735 0.826 0.830 0.869 0.843 0.945
Modified ResizeMix 0.757 0.846 0.846 0.882 0.855 0.952
Color mapping 0.760 0.841 0.854 0.853 0.877 0.954
Comb_1 0.802 0.875 0.886 0.886 0.904 0.959
Comb_2 0.805 0.875 0.882 0.891 0.893 0.962
Comb_3 0.799 0.870 0.877 0.894 0.889 0.962

Comb_1 → 'color and shape change' and 'color mapping';

Comb_2 → 'color and shape change' and 'color change and rotation';

Comb_3 → 'color and shape change' and 'color mapping' and 'color change and rotation'.

Clearly, the combination of different data augmentation approaches boosts the performance.

The third experiment (table 8.3) is aimed at designing effective ensembles by varying the activation functions. Each ensemble is the fusion by the sum rule of 14 models (since we use, in [48], 14 activation functions). For sake of computation time, this test is performed only in the Kvasir-SEG dataset augmented by the same method as table 8.1 (base data augmentation, first testing protocol). The ensemble name is the concatenation of the name of the backbone network and a string to identify the creation approach:

  • act: each of the 14 networks of the ensemble is obtained by substituting its activation layers by one of the activation functions used in [48] (the same function for all the layers, but a different function for each network).
  • sto: 14 stochastic models are generated, repricing the activation layers by randomly selected activation function used in [48] (which may be different for each layer).
  • sel: ensembles of 'selected' stochastic models. The network selection is performed using three cross-validations on the training set among 100 stochastic models. The selection procedure is aimed at picking the most performing/independent classifiers to be added to the ensemble. For a fair comparison with other ensembles, we selected a set of 14 networks, which are finally fine-tuned on the whole augmented training set at a larger resolution.
  • relu: 14 networks with standard architecture. All the starting models in the ensemble are the same, except for the initialization.

Table 8.3.  Experiments on ensembles (first protocol).

Ensemble name IoU Dice F2 Prec. Rec. Acc.
ResNet18_act 0.774 0.856 0.856 0.888 0.867 0.955
ResNet18_relu 0.774 0.858 0.858 0.892 0.867 0.955
ResNet18_sto 0.780 0.860 0.857 0.898 0.864 0.956
ResNet50_act 0.779 0.858 0.859 0.894 0.869 0.957
ResNet50_relu 0.772 0.855 0.858 0.889 0.870 0.955
ResNet50_sto 0.779 0.859 0.864 0.891 0.877 0.957
ResNet50–352_sto 0.820 0.885 0.888 0.915 0.896 0.966
ResNet50–352_sel 0.825 0.888 0.892 0.915 0.902 0.967

According to the results reported in table 8.3, it is clear that stochastic variation of activation functions (sto) allows a performance improvement with respect to a simple fusion of networks sharing the same architecture (relu) or a set of networks differing by the activation function (act). Such improvement is also noticeable if a selection procedure (sel) is used to include only the most performing architecture. Unfortunately, the selection process is computationally heavy, because it is performed among a pool of 100 networks. The best performance among the ensembles is obtained by ResNet50–352_sel, showing that the input size is critical in this problem.

Finally, in tables 8.4 and 8.5 a comparison with some state-of-the-art results is reported according to the above-cited first and second testing protocols, respectively.

In these final experiments, for sake of computational time, ResNet50 is trained using only the base data augmentation consisting of horizontal and vertical flip, and 90° rotation, while ResNet18 is trained using the here-proposed data augmentation 'color and shape change' and 'color mapping'.

The methods named HarDNet_SGD and HarDNet_Adam are our experiments using HardNet [43] trained by SGD and Adam optimizers (using the code shared by the authors). The reason for this choice is that the authors of [43] proposed a different optimizer for the two protocols.

In table 8.4, in addition to the methods presented in the above tables, we also report the performance of the following fusions by average rule (symbol ⊕):

  • Res18&50 = ResNet18_352_sel ⊕ ResNet50_352_sel.
  • HardNet_SGD/Adam ⊕ Res18&50, which is the fusion of HardNet (SGD or ADAM version) with the ensemble above.
  • HardNet_SGD⊕HardNet_Adam⊕ Res18&50, which is the fusion of both HardNet (SGD and ADAM version) with our ensemble.

Table 8.4.  State-of-the-art approaches using the first testing protocol.

Method IoU Dice F2 Prec. Rec. Acc.
ResNet18_352_sel 0.819 0.886 0.890 0.910 0.899 0.966
ResNet50_352_sel 0.823 0.887 0.890 0.916 0.899 0.966
ResNet18&50 0.831 0.893 0.896 0.920 0.904 0.967
HarDNet_SGD 0.847 0.906 0.914 0.916 0.924 0.971
HarDNet_Adam 0.829 0.888 0.893 0.916 0.904 0.963
HardNet_SGD⊕ResNet18&50 0.851 0.908 0.914 0.922 0.922 0.972
HardNet_Adam⊕ ResNet18&50 0.834 0.892 0.896 0.919 0.905 0.964
HardNet_SGD⊕HardNet_Adam⊕ ResNet18&50 0.846 0.902 0.906 0.924 0.914 0.970
U-Net [13] 0.471 0.597 0.598 0.672 0.617 0.894
ResUNet [13] 0.572 0.69 0.699 0.745 0.725 0.917
ResUNet++ [13] 0.613 0.714 0.72 0.784 0.742 0.917
FCN8 [13] 0.737 0.831 0.825 0.882 0.835 0.952
HRNet [13] 0.759 0.845 0.847 0.878 0.859 0.952
DoubleUNet [13] 0.733 0.813 0.82 0.861 0.84 0.949
PSPNet [13] 0.744 0.841 0.831 0.890 0.836 0.953
DeepLabV3+ResNet50 [13] 0.776 0.857 0.855 0.891 0.861 0.961
DeepLabV3+ResNet101 [13] 0.786 0.864 0.857 0.906 0.859 0.961
U-Net ResNet34 [13] 0.810 0.876 0.862 0.944 0.86 0.968
ColonSegNet [13] 0.724 0.821 0.821 0.843 0.850 0.949
DDANet [49] 0.78 0.858 0.864 0.888
HarDNet-MSEG [43] 0.848 0.904 0.915 0.907 0.923 0.969
U-Net ResNet34 [13] 0.81 0.876 0.862 0.944 0.860 0.968

As far the latter protocol is concerned, since the training set is larger, to reduce computation time we have run the following methods without supervised selection:

  • ResNet18_352, a stand-alone ResNet18 trained using the base data augmentation approach (only horizontal/vertical flip and 90° rotation).
  • ResNet18_352_a, a stand-alone ResNet18 trained using the data augmentation 'color and shape change' and 'color mapping'.
  • ResNet50_352, a stand-alone ResNet50 trained using base data augmentation.
  • ResNet18_352_relu_a, an ensemble of ten standard ResNet18 networks trained using the data augmentation 'color and shape change' and 'color mapping'.
  • ResNet50_352_relu, an ensemble of ten standard ResNet50 networks trained using base data augmentation.
  • ResNet18_352_sto_a, an ensemble of ten stochastic ResNet18 networks trained using the data augmentation 'color and shape change' and 'color mapping'.
  • ResNet50_352_sto, an ensemble of ten stochastic ResNet50 networks trained using base data augmentation.

Moreover, in table 8.5 the performance of the following fusions by the average rule are reported:

  • Res18&50sto = ResNet18_352_sto_a ⊕ ResNet50_352_sto.
  • HN1&R = HardNet_Adam⊕ Res18&50sto, which is the fusion of HardNet_Adam with the ensemble above.
  • HN2&R = HardNet_SGD⊕HardNet_Adam⊕Res18&50sto, which is the fusion of both HardNet (SGD and ADAM version) with our ensemble.

Table 8.5.  State-of-the-art approaches using the second testing protocol. For each test set the number of images is enclosed in parenthesis. The last three columns are the average of IoU and Dice and the rank (on IoU).

Method Kvasir (100) ClinicalDB (62) ColonDB (380) ETIS (196) CVC-T (60) Average
IoU Dice IoU Dice IoU Dice IoU Dice IoU Dice IoU Dice Rank    
ResNet18_352 0.832 0.893 0.842 0.89 0.646 0.721 0.582 0.652 0.796 0.871 0.740 0.805 15
ResNet18_352_a 0.82 0.888 0.861 0.914 0.653 0.74 0.555 0.631 0.772 0.844 0.732 0.803 18
ResNet50_352 0.839 0.895 0.845 0.898 0.637 0.716 0.523 0.591 0.819 0.892 0.733 0.798 17
ResNet18_352_relu_a 0.835 0.898 0.876 0.924 0.685 0.768 0.591 0.66 0.783 0.857 0.754 0.821 10
ResNet50_352_relu 0.85 0.904 0.865 0.911 0.635 0.712 0.554 0.615 0.821 0.891 0.745 0.807 14
ResNet18_352_sto_a 0.834 0.896 0.871 0.921 0.685 0.763 0.573 0.642 0.785 0.851 0.750 0.815 12
ResNet50_352_sto 0.853 0.909 0.859 0.906 0.655 0.729 0.561 0.617 0.82 0.891 0.750 0.810 13
Res18&50sto 0.849 0.906 0.877 0.924 0.678 0.75 0.586 0.647 0.813 0.882 0.761 0.822 9
HarDNet_SGD 0.857 0.908 0.864 0.911 0.677 0.752 0.562 0.639 0.799 0.868 0.752 0.816 11
HarDNet_Adam 0.854 0.906 0.875 0.924 0.678 0.751 0.625 0.716 0.831 0.903 0.773 0.840 7
HN1&R 0.861 0.912 0.882 0.929 0.684 0.757 0.644 0.727 0.833 0.904 0.781 0.846 5
HN2&R 0.87 0.918 0.884 0.929 0.695 0.768 0.635 0.710 0.830 0.900 0.783 0.845 4
HarDNet-MSEG [43] 0.857 0.912 0.882 0.932 0.66 0.731 0.613 0.677 0.821 0.887 0.767 0.828 8
PraNet [43] 0.84 0.898 0.849 0.899 0.64 0.709 0.567 0.628 0.797 0.871 0.739 0.801 16
SFA [43] 0.611 0.723 0.607 0.700 0.347 0.469 0.217 0.297 0.329 0.467 0.422 0.531 21
U-Net++ [43] 0.743 0.821 0.729 0.794 0.41 0.483 0.344 0.401 0.624 0.707 0.570 0.641 20
U-Net [43] 0.746 0.818 0.755 0.823 0.444 0.512 0.335 0.398 0.627 0.710 0.581 0.652 19
SETR [50] 0.854 0.911 0.885 0.934 0.69 0.773 0.646 0.726 0.814 0.889 0.778 0.847 6
TransUnet [51] 0.857 0.913 0.887 0.935 0.699 0.781 0.66 0.731 0.824 0.893 0.785 0.851 3
TransFuse [17] 0.870 0.92 0.897 0.942 0.706 0.781 0.663 0.737 0.826 0.894 0.792 0.855 1
UACANet [18] 0.859 0.912 0.88 0.926 0.678 0.751 0.678 0.751 0.849 0.910 0.789 0.850 2

Finally, for the sake of comparison in table 8.5, we report the performance of several state-of-the-art approaches: only methods following the second protocol exactly are included.

The following conclusions can be obtained from tables 8.4 and 8.5:

  • From the comparison of ResNet18_352 (or ResNet18_352_a) and the ensembles ResNet18_352_relu_a (or ResNet18_352_sto_a) we can observe the advantage of fusing networks together.
  • While in table 8.2 data augmentation granted a performance improvement for ResNet18, from the results in table 8.5 it seems that data augmentation is not helpful. The reason may be the different input sizes of the networks—by using larger input layers the performance improves reducing the need for data augmentation.
  • The fusion of ensembles created from different models (ResNet18 and ResNet50) allows a performance improvement over each component.
  • The best method for training the HarDNet model is Adam (considering all datasets from the second protocol).
  • HarDNet_Adam has a slight advantage over our proposed ensembles (Res18&50 or Res18&50sto), nonetheless, the fusion of both approaches improves both performances.

This work shows the usefulness of ensembles. Our best method is HN2&R which is given by the fusion of HardNet_SGD, HardNet_Adam and the ensemble Res18&50sto. This method closes the gap with transformers, gaining performance almost comparable to that of transformers.

8.6. Conclusion

Semantic segmentation is a very important topic in medical image analysis. In this chapter our goal is to improve the performance of polyp segmentation during colonoscopy tests.

We tested several data augmentation approaches, then we proposed combining a set of them to increase the size of the training set. Finally, we used the augmented training set for feeding an ensemble of networks, where we randomly substitute their activation functions. Very recent methods based on transformers gain the best segmentation performance in this problem. In any case, some of the ideas proposed in this paper (i.e. stochastic ensemble generation and selection, advanced data augmentation) could be coupled advantageously to transformer-based approaches.

In future work we plan to study the feasibility of reducing the complexity of our ensemble by applying some techniques such as pruning, quantization, low-rank factorization and distillation.

The MATLAB code of all the descriptors and experiments reported in this paper will be available at https://github.com/LorisNanni.

References

  • [1]Feng D, Haase-Schütz C, Rosenbaum L, Hertlein H, Glaeser C, Timm F, Wiesbeck W and Dietmayer K 2020 Deep multi-modal object detection and semantic segmentation for autonomous driving: datasets, methods, and challenges IEEE Trans. Intell. Transp. Syst. 22 1341–60
  • [2]Brandao P et al 2018 Towards a computed-aided diagnosis system in colonoscopy: automatic polyp segmentation using convolution neural networks J. Med. Robot. Res. 1840002
  • [3]Noh H, Hong S and Han B 2015 Learning deconvolution network for semantic segmentation pp 1520–8
  • [4]Badrinarayanan V, Kendall A and Cipolla R 2017 SegNet: a deep convolutional encoder–decoder architecture for image segmentation IEEE Trans. Pattern Anal. Mach. Intell. 39 2481–95
  • [5]Bullock J, Cuesta-Lázaro C and Quera-Bofarull A 2019 XNet: a convolutional neural network (CNN) implementation for medical x-ray image segmentation suitable for small datasets Proc. SPIE 10953 109531Z
  • [6]Shelhamer E, Long J and Darrell T 2015 Fully convolutional networks for semantic segmentation IEEE Trans. Pattern Anal. Mach. Intell. 39 640–51
  • [7]Roncucci L and Mariani F 2015 Prevention of colorectal cancer: how many tools do we have in our basket? Eur. J. Intern. Med. 26 752–6
  • [8]Wang Y, Tavanapong W, Wong J, Oh J and De Groen P C 2013 Part-based multiderivative edge cross-sectional profiles for polyp detection in colonoscopy IEEE J. Biomed. Heal. Informatics 18 1379–89
  • [9]Mori Y, Kudo S, Berzin T M, Misawa M and Takeda K 2017 Computer-aided diagnosis for colonoscopy Endoscopy 49 813
  • [10]Wang P et al 2018 Development and validation of a deep-learning algorithm for the detection of polyps during colonoscopy Nat. Biomed. Eng. 741–8
  • [11]Thambawita V, Jha D, Riegler M, Halvorsen P, Hammer H L, Johansen H D and Johansen D 2018 The Medico-Task 2018: disease detection in the gastrointestinal tract using global features and deep learning arXiv:1810.13278  
  • [12]Guo Y B and Matuszewski B 2019 GIANA polyp segmentation with fully convolutional dilation neural networks pp 632–41
  • [13]Jha D, Ali S, Johansen H D, Johansen D, Rittscher J, Riegler M A and Halvorsen P 2020 Real-time polyp detection, localisation and segmentation in colonoscopy using deep learning arXiv:2011.07631  
  • [14]Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I 2017 Attention is all you need pp 5998–608
  • [15]Khan S, Naseer M, Hayat M, Zamir S W, Khan F S and Shah M 2021 Transformers in vision: a survey arXiv:2001.05566  
  • [16]Kolesnikov A 2021 An image is worth 16 × 16 words arXiv:2103.13915  
  • [17]Zhang Y, Liu H and Hu Q 2021 TransFuse: fusing transformers and CNNs for medical image segmentation arXiv:2102.08005  
  • [18]Kim T, Lee H and Kim D 2021 UACANet: uncertainty augmented context attention for polyp segmentation arXiv:2107.02368  
  • [19]Ronneberger O, Fischer P and Brox T 2015 U-net: convolutional networks for biomedical image segmentation International Conference on Medical Image Computing and Computer-Assisted Intervention ( (Lecture Notes in Computer Science vol 9351) )  (Berlin: Springer)  pp 234–41
  • [20]Simonyan K and Zisserman A 2015 Very deep convolutional networks for large-scale image recognition pp 1–14
  • [21]Chen L C, Papandreou G, Kokkinos I, Murphy K and Yuille A L 2018 DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs IEEE Trans. Pattern Anal. Mach. Intell. 40 834–48
  • [22]Chen L-C, Papandreou G, Kokkinos I, Murphy K and Yuille A L 2014 Semantic image segmentation with deep convolutional nets and fully connected CRFS arXiv:1412.7062  
  • [23]Chen L-C, Papandreou G, Schroff F and Adam H 2017 Rethinking atrous convolution for semantic image segmentation arXiv:1706.05587  
  • [24]Chen L C, Zhu Y, Papandreou G, Schroff F and Adam H 2018 Encoder–decoder with atrous separable convolution for semantic image segmentation European Conference on Computer Vision ( (Lecture Notes in Computer Science vol 11211) )  (Berlin: Springer)  pp 833–51
  • [25]Minaee S, Boykov Y, Porikli F, Plaza A, Kehtarnavaz N and Terzopoulos D 2020 Image segmentation using deep learning: a survey arXiv:2001.05566  
  • [26]Khan A, Sohail A, Zahoora U and Qureshi A S 2020 A survey of the recent architectures of deep convolutional neural networks Artif. Intell. Rev. 53 1–87
  • [27]He K, Zhang X, Ren S and Sun J 2016 Deep residual learning for image recognition pp 770–8
  • [28]Jadon S 2020 A survey of loss functions for semantic segmentation pp 1–7
  • [29]Nanni L, Lumini A, Ghidoni S and Maguolo G 2020 Stochastic selection of activation layers for convolutional neural networks Sensors 20 1626
  • [30]Glorot X, Bordes A and Bengio Y 2011 Deep sparse rectifier neural networks J. Mach. Learn. Res. 15 315–23
  • [31]Maas A L, Hannun A Y and Ng A Y 2013 Rectifier nonlinearities improve neural network acoustic models pp 16–21
  • [32]Clevert D A, Unterthiner T and Hochreiter S 2016 Fast and accurate deep network learning by exponential linear units (ELUs) arXiv:1511.07289  
  • [33]He K, Zhang X, Ren S and Sun J 2015 Delving deep into rectifiers: surpassing human-level performance on ImageNet classification pp 1026–34
  • [34]Jin X, Xu C, Feng J, Wei Y, Xiong J and Yan S 2016 Deep learning with S-shaped rectified linear activation units pp 1737–43
  • [35]Nanni L, Maguolo G, Brahnam S and Paci M 2021 Comparison of different convolutional neural network activation functions and methods for building ensembles arXiv:2103.15898 
  • [36]Varkarakis V, Bazrafkan S and Corcoran P 2020 Deep neural network and data augmentation methodology for off-axis iris segmentation in wearable headsets Neural Netw. 121 101–21
  • [37]Yao P, Shen S, Xu M, Liu P, Zhang F, Xing J, Shao P, Kaffenberger B and Xu R X 2021 Single model deep learning on imbalanced small datasets for skin lesion classification arXiv:2102.01284  
  • [38]Chen P, Liu S, Zhao H and Jia J 2020 GridMask data augmentation arXiv:2001.04086 
  • [39]Sánchez-Peralta L F, Picón A, Sánchez-Margallo F M and Pagador J B 2020 Unravelling the effect of data augmentation transformations in polyp segmentation Int. J. Comput. Assist. Radiol. Surg. 15 1975–88
  • [40]Walawalkar D, Shen Z, Liu Z and Savvides M 2020 Attentive CutMix: an enhanced data augmentation approach for deep learning based image classification arXiv:2003.13048  
  • [41]Qin J, Fang J, Zhang Q, Liu W, Wang X and Wang X 2020 ResizeMix: mixing data with preserved object information and true labels arXiv:2012.11101  
  • [42]Khan A M, Rajpoot N, Treanor D and Magee D 2014 A nonlinear mapping approach to stain normalization in digital histopathology images using image-specific color deconvolution IEEE Trans. Biomed. Eng. 61 1729–38
  • [43]Huang C-H, Wu H-Y and Lin Y-L 2021 HarDNet-MSEG: a simple encoder-decoder polyp segmentation neural network that achieves over 0.9 mean Dice and 86 FPS arXiv:2101.07172 
  • [44]Jha D, Smedsrud P H, Riegler M A, Halvorsen P, de Lange T, Johansen D and Johansen H D 2020 Kvasir-SEG: a segmented polyp dataset MultiMedia Modeling (Lecture Notes in Computer Science)  (Berlin: Springer) 
  • [45]Bernal J, Sánchez F J, Fernández-Esparrach G, Gil D, Rodríguez C and Vilariño F 2015 WM-DOVA maps for accurate polyp highlighting in colonoscopy: validation vs saliency maps from physicians Comput. Med. Imaging Graph. 43 99–111
  • [46]Silva J, Histace A, Romain O, Dray X and Granado B 2014 Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer Int. J. Comput. Assist. Radiol. Surg. 283–93
  • [47]Bernal J, Sánchez J and Vilariño F 2012 Towards automatic polyp detection with a polyp appearance model Pattern Recognit. 45 3166–82
  • [48]Lumini A, Nanni L and Maguolo G 2021 Deep ensembles based on stochastic activations for semantic segmentation Signals 820–33
  • [49]Tomar N K, Jha D, Ali S, Johansen H D, Johansen D, Riegler M A and Halvorsen P 2021 DDANet: dual decoder attention network for automatic polyp segmentation arXiv:2012.15245  
  • [50]Zheng S et al 2020 Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers arXiv:2012.15840  
  • [51]Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille A L and Zhou Y 2021 TransUNet: transformers make strong encoders for medical image segmentation arXiv:2102.04306  

Export references: BibTeX RIS