Wildfire Smoke Detection Based on Swin Transformer

Wildfires have devastating consequences for ecosystems and human lives, and it is crucial to detect wildfires accurately for reducing losses. Most researchers detected wildfire smoke by traditional smoke detection algorithms, deep learning algorithms, or a combination of both, but this still has a high rate of false positive. To enhance the accuracy of wildfire smoke detection, we present an advanced algorithm based on the Swin Transformer architecture, which improves remote dependence of high-order feature maps in the convolutional neural network, thus reducing the false positive rate of smoke detection by introducing the STB module to the GoogLeNet. We approach the task of wildfire smoke detection as a classification problem, where we divide the dataset of 18, 000 real wildfire smoke images into two distinct sets: a training set and a testing set. The experimental results support the efficacy and practicality of the proposed algorithm, exhibiting a remarkable smoke detection accuracy rate of 95% while successfully reducing the false alarm rate to as low as 4%. These findings underscore the algorithm’s capability to accurately identify wildfire smoke instances while minimizing erroneous detections.


Introduction
Among the multitude of natural disasters, wildfires stand out as a prominent and pervasive menace, posing significant threats to both public safety and the ecological environment.On account of the sudden and devastating nature of wildfires, there exists a pressing real-world need for efficient fire detection and early warning systems.Detecting smoke in the early stages of a wildfire holds immense theoretical and practical significance.
Traditional wildfire smoke detection algorithms often rely on sensors or manually selected smoke characteristics.Sensor-based algorithms, as described in references [1], suggested the application of photoacoustic aerosol sensors for fire detection, but this application is prone to missed detections or false alarms due to sensor aging or malfunctioning.On the other hand, the manual selection of smoke features, as utilized in algorithms [2], introduced a smoke block detection algorithm that incorporates color features, texture features, and a single artificial neural network (ANN).The algorithm ensures dependable, swift, and continuous detection in diverse scenarios.However, the complexity of detection targets and the subjective nature of manual selection often leads to suboptimal models, thus impacting the overall accuracy of the smoke detection system.With the rapid progress and widespread adoption of computer vision and deep learning techniques, there are some smoke image detection algorithms based on deep learning [3][4][5][6], which primarily focus on the classification of individual smoke images using convolutional neural networks (CNNs) such as AlexNet, ResNet, VGGNet, and MobileNet, but they all have the occurrence of missed detection or false detection.
To tackle the challenges outlined above, we propose an innovative wildfire smoke detection algorithm that leverages the Swin Transformer architecture.In this approach, we integrate the STB into the GoogLeNet model, enabling the fusion of global and local features.By enriching the semantic information and spatial details within the advanced features of the convolutional neural network, our algorithm enhances the accuracy of smoke detection while minimizing instances of false alarms and missed detections.This integration ensures a comprehensive and robust detection system for wildfire smoke.

Related work
In this section, we primarily present a comprehensive overview of smoke image detection algorithms that leverage deep learning and Transformer techniques.

Smoke image detection approach that harnesses deep learning
Zhang et al. [3] introduced a novel smoke image classification method called DarkC-DCN, which leverages dark channel priors and a dual convolutional network.The primary framework of the method utilizes a modified version of the AlexNet residual network, achieving an impressive classification accuracy of 98.56%.You et al. [4] introduced a dual attention network with a three-step bilinear pool to complete the task of classifying fire smoke instances, with an accuracy of 90.11%.Gong et al. [5] introduced an attention mechanism that integrates the fusion modules-based feature and decision for smoke detection.This approach incorporates spatial attention and channel attention within the VGG network architecture.Experimental findings demonstrate its superior accuracy in detecting small-sized smoke targets.Palade et al. [6] combined the power of edge computing with the accuracy of CNNs to enable efficient and accurate smoke detection.which used lightweight network architecture and has good detection performance.

Transformer
Transformer [7] is designed for sequence modeling and transformation tasks, which has achieved great success in the language field as it focuses on long-term dependency modeling in data.In the current landscape, there have been notable advancements in applying Transformer models to the domain of computer vision.Currently, within the realm of computer vision, several well-known transformer network architectures have gained popularity.These include Vision Transformer (ViT) [7] and Swin Transformer [8], which Swin Transformer has less computation and the network architecture is closer to a Convolutional neural network.

Wildfire Smoke Detection Based on Swin Transformer
The primary objective of the proposed approach is to accurately identify smoke areas in input images and classify them correctly.Since the smoke is a non-rigid object with an irregular shape, we use the GoogLeNet [9] as the basic network architecture.At the same time, STB (Swin Transformer Module) is introduced to GoogLeNe, which improves the precision of the classification outcomes.The complete network architecture is depicted in Figure 1.

Increased Swin Transformer Block
The original inception module of the I3D network

GoogLeNet Review
The GoogLeNet [9] model is the champion model for the ILSVRC 2014 image classification task, which consists of 22 layers and depth is deepened, but the parameter quantity is only 1/12 of AlexNet [3], and the detection effect is better.The key breakthrough of GoogLeNet lies in the incorporation of the Inception block, which places filters of various sizes in parallel on the same layer of the network, enabling the fusion of features of different scales within a single layer, making the detected object has scale invariance; On the other hand, the Inception module added 1×1 convolution kernel before the 3×3 and 5×5 convolution kernel to decreases the channel count, and added 1×1 filter after 3×3 pooling layer to reduce dimensionality, which effectively alleviating the pressure of computing resources.In addition, GoogLeNet also has two auxiliary classifiers, which also have strong recognition ability in relatively shallow networks, so the output of an intermediate layer can be leveraged for classification purposes.

The proposed method
According to the actual data set, we use a shallower network to classify, reducing the original 22 layers to 12 layers, and achieving better results.The original GoogLeNet network primarily focuses on local features during the detection process, overlooking the significance of global features within an image.As a result, this drawback ultimately contributes to an increased occurrence of incorrect positive detections in real-world scenarios.To address this limitation, we introduced four STB blocks in GoogLeNet to process and aggregate remote dependencies of high-level features in the original network.The detailed structure is depicted in Figure 2. The STB block can calculate relationships not only in adjacent locations but also in distant locations.The essence of an STB is two consecutive Swin Transformer Blocks [8].Swin Transformer replaces the MHSA (multi-head attention mechanism) in Transformer with an MHSA-based window, greatly reducing network computation.However, if each window is distributed equally, the network will only notice its window, and ignore the connections in its adjacent Windows, so we adopted an attention mechanism-based shift window which can increase the connections between windows.As in Swin Transformer Block, SW-MSA is performed after W-MSA operated, which established the communication between windows and improved the global characteristics.Therefore, we add the STB module to GoogLeNet by referring to Swin-T [8], which increases remote dependence and sends the output feature graph of GoogLeNet  to STB module, The output feature graph of the STB module  can be expressed as the following formula: ̂ -    (1) ̂ ̂ (4) where LN represents layer normalization, MLP is two fully connected layers with GELU as in the MLP block of the Transformer, W-MSA refers to a window-based MHSA, and SW-MSA represents a shift window-based MHSA.SW-MSA is calculated by the MHSA-based window with mask, calculated by circulating shift of the existing window into the original window.To identify the positions of different Windows, we need to add relative position coding [8] in MHSA based window.In summary, the GoogLeNet which has an STB module enables the model to capture the spatial relationship between different features, taking into account both their content and relative distance.As a result, the network can effectively focus on relevant areas by incorporating global information and facilitating convergence.

Experiments
The objective of the proposed algorithm is to accurately classify images and identify the presence and location of smoke.In this paper, we use annotated data sets to train the classifier to obtain classification results.We determine the location of smoke by GradCam [10] which is used for visual analysis.The dataset used in this paper's algorithm comprises a total of 18, 000 annotated RGB images, each with a resolution of 1920x1080 pixels.Within this dataset, 9, 000 images represent smoke instances, while the other 9, 000 images represent fog scenarios.To ensure comprehensive evaluation, the dataset is divided as follows: 70% for training, 10% for validation, and 20% for testing.The experimental environment is Ubuntu, utilizing an AMD Ryzen 9 5900X 12-Core Processor and an NVIDIA GeForce RTX 3090 for processing and computations.We use the weights trained on the ImageNet network to pre-train.A learning rate of 0.01 was initially set, accompanied by a momentum value of 0.9.
To evaluate the proposed algorithm better, we compared it with the state-of-the-art image classification methods, including AlexNet [3], ResNet [4], VGGNet [5] and MobileNet [6].To ensure the fairness of the evaluation results, all experiments were carried under the same experimental conditions, and all algorithms were fine-tuned on the original pre-trained model.Table 1 shows the video classification results of these different methods.By observing the smoke detection results in Figure 3, it becomes evident that traditional convolutional neural networks tend to focus solely on local features during feature extraction, neglecting the contribution of global detection targets to network classification.As a consequence, misjudgment can occur, and the network can only identify a small portion of the smoke area while failing to locate all smoke regions.In contrast, the proposed algorithm in this paper exhibits superior performance in recognizing small smoke targets.Particularly in the first and last images of Figure 3, AlexNet, ResNet, VGGNet, and MobileNet demonstrate noticeable errors in smoke recognition.Conversely, the proposed algorithm in this paper accurately identifies the smoke area.In conclusion, the Swin Transformer algorithm introduced in this study enables precise classification of smoke generated by wildfires and accurate localization of smoke occurrence areas.

Conclusion
We propose an algorithm designed for the detection of wildfire smoke combining the local features of the traditional neural network and the global features of the Swin Transformer.This algorithm adds an STB module to the original GoogLeNet network which improves the spatial information of the highlevel output features within the Convolutional neural network, thus improving the accuracy of smoke classification.We conducted a comparative analysis between our approaches and four other advanced techniques.The results obtained from the experiments indicate that the best model accuracy rate achieved an impressive 95%, while the false positive rate was successfully reduced to low 4%.These findings affirm the outstanding performance of the innovative method in accurately detecting smoke while minimizing false alarms.

Figure 3 .
Figure 3. Smoke detection results by different methods.

Table 1 .
Experimental comparison result As shown in Table1, the evaluation results of our algorithm on the test dataset indicate its performance is superior to these benchmark testing approaches.The specific experimental results are depicted in Figure3.