Efficient and Lightweight Spatial Attention-Based Shallow CNNs with Superior Performance for Concrete Surface Crack Detection

To solve the problems of high training cost and low time efficiency when detecting concrete surface crack defects, this paper proposes a concrete surface crack detection method based on shallow CNNs. First, an image dataset of concrete surface cracks is constructed and preprocessing operations are performed on the dataset. Then, a shallow CNNs model for concrete surface crack detection is constructed, hyperparameters of the network model are set, spatial attention mechanism is introduced, and the dataset is input to the model for training. Finally, the results are analyzed and evaluated, and compared with the mainstream deep learning models InceptionV3 and Resnet50. Experimental results demonstrate that our proposed method achieves a 99.56% accuracy rate for detecting cracks on concrete surfaces, while also offering higher time efficiency and lower training costs. Given the significantly higher training costs associated with mainstream deep learning models such as InceptionV3 and ResNet50, our proposed method holds promise for future applications in industrial production processes and the intelligent field of construction quality inspection. By effectively, automatically, and accurately identifying cracks on concrete surfaces, it can reduce the risk of catastrophic failures.


Introduction
Concrete is a widely-used building material that forms the foundation of various structures, including buildings, bridges, dams, and other infrastructure.Ensuring the integrity and safety of these structures is of utmost importance.Over time, concrete may develop cracks due to loading conditions, environmental exposure, material degradation, and construction errors.Early detection and monitoring of these cracks are essential for maintaining the structural health of infrastructure, preventing catastrophic failure, and ensuring public safety.Traditional concrete crack detection methods, such as visual inspection and non-destructive testing techniques (e.g., ultrasonic testing, radiography), can be labor-intensive, time-consuming, and often necessitate expert knowledge for accurate result interpretation.Furthermore, these methods may lack the sensitivity required to detect small or concealed cracks, potentially leading to overlooked structural issues.
In recent years, deep learning models have been widely applied to defect detection tasks in industrial fields.For example, Suh and Cha proposed a multi-type pixel-level crack detection method based on Faster R-CNN [1]; Cheng et al. introduced a road crack detection method using deep convolutional neural networks and the U-Net architecture [2]; and Qu et al. developed an improved road surface crack classification network model based on VGG16 [3].Despite these advancements, research on utilizing machine learning for concrete crack detection remains relatively scarce.
Therefore, this paper aims to explore methods for efficiently, accurately, and automatically detecting concrete defects and cracks using deep learning.
In this study, we propose five novel shallow convolutional neural network (CNN) architectures based on spatial attention mechanisms for detecting cracks on concrete surfaces and compare them to mainstream deep learning models, such as InceptionV3 and ResNet50, so as to explore new methods suitable for this task.The experimental results show that our proposed shallow CNN models perform exceptionally well, which holds promise for adding value to the concrete crack detection tasks in the industrial sector by automating the process and improving both accuracy and efficiency.

Data Souce
This paper utilizes an open-source dataset of cracked concrete images collected from various campus buildings at Middle East Technical University for classification purposes [4].The dataset is divided into "negative" and "positive" categories, where the "negative" category comprises concrete images without cracks.The dataset contains 40,000 concrete surface images of 227 × 227 pixels with RGB channels.

Data Preprocessing
Firstly, the contrast and brightness of the images in the dataset are adjusted to better highlight image features, thereby facilitating model learning.To ensure that all images have the same size and aspect ratio before being input into the model, we use the resize function in the cv2 library to adjust the scaling and size of the images.Moreover, the images are normalized to scale the pixel values between 0 and 1, which contributes to reducing computational complexity and accelerating model convergence.
Noise is inevitable during image processing.Therefore, it is also essential to smooth the image with a Gaussian filter to reduce the effect of noise and enhance the model's ability to recognize image features.Furthermore, our proposed model utilizes the imread function in the cv2 library and specifies the GRAYSCALE parameter, converting images from RGB three-channel to grayscale single-channel.This is different from mainstream models (such as ResNet50) that use RGB three-channel.This approach has the advantage of reducing image data dimensions, thus decreasing the amount of information the model needs to process, lowering computational complexity, and speeding up model training.

Model Architecture Design
Convolutional neural networks (CNNs) are a type of deep learning model that has been widely used in fields such as computer vision and natural language processing.The basic structure of a CNN includes an input layer, convolutional layer, activation function, pooling layer, and fully connected layer.In the field of concrete surface crack detection, there have been many studies on using deep CNNs for crack detection.For example, in [5], the authors proposed a vision-based method based on a deep convolutional neural network (CNN) for detecting concrete cracks, which can automatically learn image features without the need for other image processing techniques for feature extraction, compared to traditional image processing techniques.In [6], the authors developed a deep learningbased method for the automated processing of concrete surface images to perform crack recognition tasks and used shadow enhancement techniques to improve the accuracy of the automatic detection of cracks in concrete.In [7], the authors proposed a convolutional neural network called RUC-Net for pixel-level road crack segmentation and verified its effectiveness in road crack segmentation through experiments.In [8], the authors proposed a detection model based on a deep convolutional neural network Inception-ResNet-v2 network for evaluating cracks in ballastless track slabs, showing good performance, robustness, and adaptability to noise and lighting.In [9], the authors proposed a novel deep residual convolutional neural network (Parallel ResNet) that achieved excellent results and can be used as a reliable method for analyzing road surface crack images and planning the best road maintenance strategy.
However, the complexity and computational cost of deep CNNs often limit their feasibility in practical applications.Therefore, in some specific scenarios, shallow CNNs may have better costeffectiveness.In order to better and more efficiently address concrete surface crack detection tasks, this paper proposes five customized shallow CNN architectures based on spatial attention mechanism, among which the best-performing one is CNN-5.This makes the model highly efficient in terms of training cost and computational cost.This section will provide a detailed description of the design of this network architecture and its components.

Spatial Attention Layer
To begin, we constructed a customized Spatial Attention Layer (SAL), which aims to guide the network to focus on regions that are more likely to contain cracks by learning weight assignments for different areas.Specifically, we used a convolutional layer with one filter and a kernel size of 1x1 in the SAL and calculated attention weights using a sigmoid activation function.We then multiplied the attention weights with the input feature map to obtain the attention-modulated feature map.After introducing attention mechanisms, our model can focus on a specific texture feature of concrete cracks, resulting in improved performance.Moreover, our architecture is primarily based on CNNs, and the incorporation of attention mechanisms allows our model to maintain high accuracy while significantly reducing training time and costs.

Network Architecture of CNN-5
The proposed network architecture consists of 5 convolutional layers, 5 pooling layers, and 5 Spatial Attention Layers (SALs), as well as a fully connected layer for crack classification.Convolutional layer 1 performs feature extraction on the input image with 64 filters, a kernel size of 3, "same" padding, and ReLU activation.Pooling layer 1 reduces the dimensionality of the output of convolutional layer 1 through max pooling.After the first pooling layer, we added a custom SAL to enhance the feature representation of key regions by learning weight assignments for different areas, enabling the network to focus on regions that are more likely to contain cracks.We then added convolutional layers 2, 3, and 4, each with the same configuration as convolutional layer 1, and equipped them with pooling layers 2, 3, and 4, as well as SALs 2, 3, and 4. Subsequently, convolutional layer 5 extracts feature from the output of the fourth SAL via a convolutional layer with 128 filters, a kernel size of 3, "same" padding, and ReLU activation.Similarly, pooling layer 5 and SAL 5 were added after convolutional layer 5.The output of SAL 5 was then flattened for input into the fully connected layer with 256 neurons and ReLU activation.Following the fully connected layer, we added a Dropout layer to randomly drop out some neurons with a probability of 0.5 to improve the model's generalization ability.After the Dropout layer, we added a BatchNormalization layer to speed up the training process and improve model performance.Finally, we used a fully connected layer with 2 neurons and softmax activation to perform crack classification, computing the probability of each class.
Through the design of the shallow CNN-5 architecture described above, we have successfully proposed a customized model that exhibits optimal performance in detecting cracks on the surface of the concrete.While reducing computational complexity and memory consumption, the shallow CNN-5 network still demonstrates performance comparable to deep CNNs.

Network Architectures of Other Shallow CNNs for Comparison
The other four customized shallow CNN architectures are similar to the CNN-5 described in Section 2.3.1, differing in the number of convolutional layers, pooling layers, and attention layers.The CNN-4 network architecture includes four layers of convolutional, pooling, and spatial attention layers, as well as a fully connected layer for crack classification, which is configured similarly to CNN-5: the first three convolutional layers have 64 filters, kernel size of 3, padding mode of "same," and ReLU activation function, while the final convolutional layer has 128 filters.Similarly, the configurations of CNN-x (where x = 1, 2, 3) are similar to the above configuration, with convolutional layers (excluding the xth layer) having 64 filters, kernel size of 3, padding mode of "same", and ReLU activation function, while the xth convolutional layer has 128 filters.In CNN-x (where x = 1, 2, 3, 4), the pooling layers, spatial attention layers, and fully connected layers have the same configuration as described in Section B.
The network architectures for the custom models with optimal performance, CNN-5, and the other four custom models, CNN-x (where x = 4), are shown in Figure 1.

Inception-V3 Model
Inception v3, a convolutional neural network for image analysis and object detection, is the third iteration of Google's Inception series.The Inception module, its key building block, executes parallel convolutional or pooling operations on input images, generating a feature map for improved representation.Our 48-layer, 21.8-million-parameter Inception-V3 model is pre-trained on ImageNet and applied to crack detection, using 3x224x224 image input and a 2-neuron Softmax output layer for binary classification.

Resnet-50 Model
ResNet, a deep learning architecture for image recognition and classification, addresses the "degradation problem" with "shortcut connections," facilitating the training of deeper networks.ResNet-50, used in our study, has 50 layers and 29.5 million trainable parameters.This heavier model demands more computational resources for training and deployment, reducing cost-effectiveness.We apply a pre-trained ResNet50 model from ImageNet to crack detection, using 3*224*224 image input, 5 stages with 49 convolutional and 1 fully connected layers, and a 2-neuron Softmax output layer for binary classification.

Experimental Setup
All experiments were conducted using an Intel(R) Xeon Gold 6142 @ 2.6GHz CPU, 27.1GB RAM, and NVidia GeForce RTX 3080 GPU.The experimental environment employed Docker v20.10.10,Python v3.7, PyTorch v1.10, and TensorFlow v2.7.0.We employed sparse categorical cross entropy as the loss function and the Adam optimizer for weight updates.Model performance metrics included Accuracy, Loss, Precision, Recall, F1-score, AUC area, and Training time.Combining grid search and random search, we utilized Grid-Search with Random Sampling for parameter optimization.The dataset was split into 75% training and 25% validation sets, using a batch size of 128 and training for 20 epochs.

Experimental Evaluation
In this study, we will evaluate the effectiveness of the trained classification model using various performance metrics, including Accuracy, Loss, Precision, Recall, F1-score, the area under the Receiver Operating Characteristic (ROC) curve, and Training time.The equations of the evaluation metrics are shown in the Table 1.
Table 1.The equations of the evaluation metrics.

Metrics Equations
Accuracy

Results of Shallow CNNs
The experiments were conducted by comparing the results obtained from the customized Shallow CNNs, ResNet-50, and Inception-V3 models.Among the five classes of shallow CNNs proposed in this paper, the customized CNN-5 model demonstrated the best performance.The accuracy and loss of the CNN-5 model stabilized after the 10th epoch, ultimately reaching 99.56% and 0.014 after 20 epochs, respectively.Concerning the remaining evaluation metrics, CNN-5 continued to excel, with a precision value of 0.9972, a recall value of 0.9999, an F1-score of 0.9986, an AUC area of 0.9991, and a training time of 206.02 seconds.Furthermore, the minimal difference between the training and validation accuracies suggests that the model does not suffer from overfitting and performs impeccably.
The other shallow CNNs proposed in this paper also performed commendably, albeit slightly inferior to CNN-5.Among them, CNN-4 and CNN-3 achieved accuracies above 99%, and their precision, recall, F1-score, and AUC area values were all closely aligned with those of CNN-5.The training time ranged between 160 to 180 seconds.In comparison, the performance of CNN-2 and CNN-1 was marginally lower, with accuracies stabilizing at 98% and 96%, respectively, and the training times is relatively short.Except for the precision and F1-score values of CNN-1, and the precision values of CNN-2, which were 0.9721, 0.9858 and 0.9887, respectively, the evaluation metrics for each custom model were greater than 0.9940.The other customized shallow CNN models also did not exhibit overfitting.To better compare the performance of each model, accuracy and loss plots for model training and validation are provided in Figure 2 below.
Regarding the AUC area, the values of all the proposed models are very high, ranging from 0.9973 to 0.9994, and are very close to each other, indicating no significant differences in their classification performance.The ROC curves for each model are plotted, showing that shallow CNN models can achieve excellent performance in image classification tasks.The evaluation of the remaining performance metrics is needed to determine the best shallow CNN model based on the spatial attention mechanism for a specific task.Table 2 shows the specific values of each evaluation metric of the five shallow CNN models based on the spatial attention mechanism.

Results of Mainstream Deep Learning Models
As illustrated in Figure 3 and Table 3, InceptionV3 demonstrated impressive performance in accuracy, precision, recall, F1-score, and AUC area, achieving 99.70% accuracy and 0.0050 loss after 20 epochs.The model effectively identified cracked and non-cracked images, with no signs of overfitting.However, its complexity results in higher training costs and longer times, especially in resourcelimited scenarios.Thus, balancing performance and computational cost is crucial.Our proposed shallow CNNs based on spatial attention mechanisms may offer a more suitable choice for efficient industrial production.
ResNet50 outperformed all tested models in accuracy, precision, recall, F1-score, and AUC area, stabilizing at 99.94% accuracy and 0.0010 loss after 20 epochs.Similarly, as shown in Figure 3 and Table 3, it effectively identified crack and non-crack images without overfitting.Despite ResNet50's optimal performance, its extended training time and higher costs increase deployment expenses and limit applications.A balance between performance and computational overhead is crucial in practice.The more complex ResNet50 and InceptionV3 models hinder efficiency and resource optimization in industrial settings.Our proposed shallow CNNs incorporating spatial attention offer a potentially better alternative.

Comparison of Model Training Costs
Although the results of the seven models mentioned above are similar, the training cost of mainstream models is much higher than that of the shallow CNNs we proposed.Specifically, ResNet50 took the longest time to train, at

Discussion
This paper proposes a set of shallow neural networks based on spatial attention mechanisms, namely CNN-1 to CNN-5, which achieve outstanding performance in the recognition of concrete crack images using a publicly available dataset.Among them, CNN-5 performs the best with a test accuracy and loss of 99.56% and 0.014, respectively, and a training time of only 206 seconds.In comparison to the two prevalent models, Inception v3 and ResNet50, the training time for CNN-5 constitutes merely 20% and 10% of their respective durations.Simultaneously, the model's accuracy is almost equivalent, exhibiting a performance that closely matches these mainstream counterparts.
The excellent performance of the five shallow convolutional neural networks used in this paper can be mainly attributed to two reasons: 1) the combination of shallow neural networks with spatial attention layers enables better extraction of important features related to crack parts in the images, which is suitable for concrete crack detection tasks; 2) the crack features in the dataset are not very complex, and shallow networks are sufficient to handle the recognition task.
Although the proposed model exhibits performance nearly on par with the two mainstream models, Inception v3 and ResNet50, it offers considerable advantages in terms of reduced training time and cost.The limitations of their application in such tasks mainly lie in the relatively long training time, resulting in high industrial deployment costs.
Currently, the research on using machine learning to recognize surface cracks is focused on the application and improvement of mainstream deep neural network models, including AlexNet [10], VGG16 [3], and YOLO V3 [11].These models all have the characteristics of complex structures and a long training time, resulting in high deployment costs in practical applications.In contrast, the proposed shallow convolutional neural networks in this paper have the advantages of lightweight, short training time and high recognition accuracy, which have more potential for industrial production and development.
For the publicly available dataset used in this paper, shallow convolutional neural networks based on spatial attention mechanisms have achieved excellent results almost comparable to deep neural network architectures.Nonetheless, it is imperative to emphasize that the generalizability of the model proposed in this study requires further extensive validation on a broader range of datasets in the future.This would ensure that the model is truly applicable to practical industrial scenarios and can be effectively utilized.

Conclusion
This study initially constructs an image dataset of concrete surface cracks and preprocesses the dataset.Subsequently, we propose a shallow CNN model based on the spatial attention mechanism for concrete surface crack detection, set the hyperparameters of the network model, introduce the spatial attention mechanism, and input the dataset for model training.Finally, we analyze and evaluate the results, comparing them to mainstream deep learning models such as InceptionV3 and Resnet50.
The shallow neural networks based on spatial attention mechanism, CNN-1 to CNN-5 proposed in this study, exhibit outstanding performance in the task of concrete crack detection.Among them, CNN-5 performs the best, with a test set accuracy of 99.56%, a loss of 0.014, and a training time of 206 seconds.When contrasted with the mainstream InceptionV3 and ResNet50 models, CNN-5 considerably reduces training time costs to a mere 20% and 10%, respectively, all the while sustaining a remarkably similar level of accuracy.
Consequently, the proposed shallow convolutional neural network possesses the characteristics of lightweight design, short training time, and high recognition accuracy, exhibiting great potential for practical industrial development.We anticipate its application in future industrial production processes and intelligent construction quality inspection fields, enabling more efficient, automated, and accurate identification of concrete surface cracks, thereby reducing the risk of catastrophic failures.

Figure 2 .
Figure 2. The accuracy and loss of the proposed shallow CNNs model on the validation sets.

Figure 3 .
Figure 3.The accuracy and loss of mainstream deep learning models on validation sets.
1977 s, followed by Inception V3 at 1126 s.In contrast, the training cost of the five proposed shallow CNN models varied only slightly, ranging from 145s for CNN-1 to 206s for CNN-5.The results demonstrate that the top-performing shallow CNN model based on spatial attention mechanism, denoted as CNN-5, exhibits an accuracy nearly equivalent to that of the ResNet50 model, indicating a comparable performance.Remarkably, the training time for CNN-5 is a mere 20% of the time required for the ResNet50 model, highlighting its efficiency.ICAITA-2023 Journal of Physics: Conference Series 2637 (2023) 012020 IOP Publishing doi:10.1088/1742-6596/2637/1/0120208

Table 2 .
The values of the evaluation metrics for the proposed models

Table 3 .
The values of the evaluation metrics for the proposed models