COVID-19 Intelligent Detection System Based on Cloud-Edge Fusion Recognition Model

Analyzing cough sounds can help with the quick detection of COVID-19. A cloud-edge deep learning fusion-based intelligent detection system for COVID-19 is proposed in this paper. In the cloud-side, a COVID-19 detection model based on ResNet18 is employed, with log-Mel-spectrum characteristics used as inputs. In the edge-side, a COVID-19 detection model based on TCNN is developed using raw audio inputs. To improve the detection accuracy, result fusion is carried out in the cloud-side after getting the recognition results from both models. On the test dataset, the fusion model attained a sensitivity of 0.8012, an AUC of 0.8251, and a specificity of 0.7255. According to comparative testing results, the fusion model outperforms the other models in classification performance and is less prone to false-positive errors. It provides a novel way to COVID-19 recognition and performs well as an auxiliary detection method.


Introduction
Since the COVID-19 epidemic in 2019, various nations and locations throughout the world have experienced the disease spread [1].The two main techniques for COVID-19 detection are nucleic acid amplification tests (NAAT) and antigen tests.Although difficult, time-consuming, and requiring specialized labs and equipment, nucleic acid test provides a high degree of accuracy and is efficient in identifying early-stage infections.However, it is also vulnerable to sample quality and storage conditions.On the other hand, antigen test is quick, simple, and affordable, but it has a lower sensitivity and can be impacted by the sources and techniques used to acquire the samples [2].In addition to traditional diagnostic methods, the importance of artificial intelligence in medical diagnosis cannot be ignored.Through AI-driven diagnostic systems, high-dimensional, nonlinear, and dynamic medical information can be processed to improve the accuracy of diagnosis and classification performance [3].The effects of COVID-19 on patients' acoustic feature have been examined by MIT researchers from four angles: muscular exhaustion, vocal cord modifications, cognitive and emotional changes, and structural alterations in the lungs and respiratory system [4].They have established that COVID-19 infection and non-infection differ in audio aspects, offering a theoretical foundation for COVID-19 detection based on cough sound.In order to identify COVID-19, Mohammed et al. [5] presented an AI-based cough sound identification approach after discovering a high degree of consistency in the Mel-frequency cepstral coefficients (MFCC) of cough sounds from COVID-19infected patients.In order to extract and recognize characteristics from cough noises, Siddharth et al. [6] employed the ResNet50 network and obtained an AUC of 0.76 on the test set.In a comparison of COVID-19 detection methods using cough sounds and respiratory sounds, Jing Han et al. [7] found that cough sound-based COVID-19 identification performed better.Ali et al. [8] created a classifier for the DiCOVA2021 challenge that used lung X-ray pictures and audio information from coughing.They employed support vector machines (SVM) for classification and principal component analysis (PCA) to minimize the dimensionality of MFCC features, attaining an accuracy of almost 90%.Significant variations in the energy ratio of low-frequency range between infected and non-infected persons' cough noises were found by Zhang et al. [9].Furthermore, Time-Frequency Differential Features (TFDF) were calculated from cough sounds and coupled with log-Mel-spectrum features in a parallel model, yielding an AUC of 0.74 on the Virufy dataset.However, the majority of existing algorithms were only trained on a small number of datasets, limiting their ability to recognize diverse datasets.The massive size of these models also makes them difficult to deploy on edge-side.This work gathered numerous datasets, including Coswara [10], Track-1 [11], CoughVid [12], NeurIPs2021 [13], and CCS [14], and carried out a thorough examination of multiple datasets to solve these concerns.For model pre-training, the Track-1, CoughVid, NeurIPs2021, and CCS datasets were used.Following that, the Coswara dataset was utilized for model fine-tuning.This work proposes an intelligent detection strategy for COVID-19 based on Cloud-Edge fusion recognition model to increase recognition performance.The system comprises of two distinct recognition models that are deployed, respectively, in the edge-side and the cloud-side.The accuracy of the model is significantly improved by the fusion strategy.The study is organized as follows: Chapter 2 introduces the COVID-19 intelligent detection system, which is based on cloud-edge deep learning fusion; Chapter 3 discusses the performance of recognition, including data preprocessing, training hyperparameters, model performance analysis, and comparisons; and Chapter 4 offers a summary of this paper.

System Architecture
The COVID-19 intelligent detection system intends to implement cough-based intelligent COVID-19 detection using smartphones.It is made up of modules such as audio recording, edge-side recognition, data transfer, cloud-based recognition, and result fusion.The system's workflow diagram is presented in Figure 1.

Figure 1. Flow chart of inspection system
Edge-side recognition is performed when the user has recorded a cough lasting more than two seconds.The user has the option of uploading or not uploading the audio.When the recognition result and the original audio file are submitted, they are both sent to the cloud-side.Upon receiving, the cloud-side extracts the cough's log-Mel-spectrum features and performs recognition.Cloud-side recognition results are fused with edge-side data to get the final detection result, which is then given back to the edge-side.In the edge-side, a lightweight TCNN recognition model with fewer model parameters is applied to address the issue of resource consumption when multiple users use the system concurrently, resulting in insufficient processing capability in the cloud-side.The cloud-side, on the other hand, adopts a ResNet18 model to improve the detection's accuracy.Figure 2 shows the edgeside functional interface, and the application programs is compatible with the Android operating system.An Android 12 operating system-powered Xiaomi 10 Lite Zoom was employed in this experiment.TCNN can better capture the long-term dependencies in input data without the need for explicitly designed gate units, making training and optimization simpler.By utilizing convolutional layers, TCNN can effectively extract local features from the input data.Additionally, the parameter sharing mechanism in TCNN reduces the network's parameter count, enhancing the model's generalization capability [14].This research presents a TCNN-based model for edge-side recognition.
The input for the edge-side recognition model is unprocessed audio, and the output is recognition results.Figure 3

COVID-19 cloud-side identification model based on ResNet18
In the cloud-side, the ResNet18-based recognition model is proposed.The model employs residual structure as its fundamental module, enabling it to more effectively address the issue of network degradation.Moreover, it has a sufficient number of parameters that can better fit the data, which is beneficial to enhancing the model's accuracy.In this paper, the activation function PReLU is chosen due to its superior performance.Compared to ReLU activation function, it has a lower risk of overfitting [15], and the coefficient of the negative part is derived from adaptive learning, which tends to retain more information at an earlier stage and be more discriminatory at a deeper stage, thereby enhancing model recognition capability.Because various initialization procedures often yield different outcomes for diverse data sets, this study alters the model parameter initialization strategy.For the convolutional layer, normal distribution random number filling parameters are generated using the He_initiatization method.The bias for the complete connection layer is set to 0.1 and the weights are initialized with a uniform distribution.In this model, the cough audio is drawn as the log-Melspectrum input, the challenge of audio recognition is changed into an image recognition problem, and the likelihood of the audio output being negative or positive is determined by the log-Mel-spectrum input.

Cloud-Edge Recognition Fusion Model
The fusion model consists of two components.The first component is the TCNN model, which takes as input preprocessed cough sounds.Parallel to the TCNN model is the ResNet18 model, which accepts log-Mel-spectrograms as input.Both components return positive and negative recognition probability vectors.The aforementioned probability vectors are then fed into a fusion classifier with three fully interconnected layers to produce the final recognition result.The pre-training parameters are loaded onto the TCNN and ResNet18 models in order to train the fusion model, and the resulting parameters of the fusion classifier are saved.During testing, the identification results from the edgeside TCNN model and the cloud-based ResNet18 model are input directly into the cloud-based fixedparameter fusion classifier to produce the final result.The learned fully connected layer classifier is more objective, lowering the influence of subjective experience and eliminating the need for manual weighting, so enhancing recognition performance.

Pre-processing
This work employs various COVID-19 datasets as training data, and the Coswara dataset is employed for comparison with prior studies.Due of the limited size and data imbalance of the Coswara dataset, data augmentation and pre-training procedures are utilized to prevent overfitting and other difficulties.Track-1, CoughVid, NeurIPs2021 and CCS datasets were utilized for model pre-training.The Coswara dataset is utilized for subsequent training.Due to the varied quantity and quality of the pretraining datasets, cough sound detection is conducted on poor quality audio of pre-training datasets.In order to ensure a balance between positive and negative data, In the pre-train set, for the CoughVid, only the positive cough audio were used, for other datasets, both negative and positive samples were used.In the subsequent training set, the Coswara dataset is augmented using data augmentation approaches such as adding Gaussian noise and loudness normalization.The pre-training dataset and subsequent training dataset are separated into training and testing sets in a ratio of 9:1, and the performance of the model is evaluated using 5-fold cross-validation.The original file status of the datasets is presented in Table 1 [9], cough sound information exists primarily in the lowfrequency region.In this work, the audio is therefore resampled to 16kHz and the WebRTC VAD approach [16] is employed for cough sound recognition and extraction.The extracted cough sounds are subjected to pre-emphasis, framing, and windowing to produce audio files with a duration of 2 seconds and a 50 percent overlap, which are then used as inputs for the edge-side recognition model.The log-Mel-spectrum feature produced from segmented audio is chosen as the input for the cloudside model after comparing the performance of COVID-19 recognition using features such as MFCC, Chirplet, TFDF, and log-Mel-spectrum.The log-Mel-spectrum feature is computed with a window length of 1024 samples, a hop length of 512 samples, and a mel frequency band of 128.

Hyperparameters during model training
Through repeated experimentation, the optimal hyperparameter combinations determined in this study are presented in Table 2

Performance analysis of edge-side and cloud-side model
In this work, the five-fold cross validation method is utilized, and the AUC and confusion matrix [3] are used as performance evaluation metrics.Given that misclassifying positive patients as negative is a grave error, this paper places greater emphasis on the model's sensitivity.

Performance analysis of fusion model
The performance of the fusion model on the test set is shown in Figure 6.The fusion model achieved an AUC of 0.8417 and a sensitivity of 0.8081.This is attributed to its integration of both time and frequency domain features of the samples and the model fusion through a fully connected classifier.By combining the recognition results of the two models and mutually verifying them, the fusion model achieves more accurate classification.
Additionally, it demonstrates stronger ability in accurately identifying positive samples.
(a)ROC (b)Confusion Matrix Figure 6.Performance of fusion model Table 3 presents a comparison between the model provided in this work and models from prior studies that also employed the Coswara dataset for testing purposes.
Table 3. Model performance comparison Model AUC Spe Sen DICOVA(baseline) 0.7490 --Chen [17] 0.7636 --Zhang [9] 0.8163 0. The fusion model showed greater distinct discriminability than the separate edge-side and cloudside models.The fusion strategy complemented the strengths of the edge-side and cloud-side models while verifying each other, ultimately increasing the accuracy of the model.It did this by extracting a variety of time-frequency and temporal features from the samples and combining them with the timefrequency features recognized in the cloud-side with the temporal features recognized in the edge-side in the fully connected layer.The fusion model also shown excellent capacity to recognize distinctive characteristics of positive samples, leading to improved sensitivity and better discrimination of positive samples, minimizing the impact of misclassifying positive samples.

Conclusion
This paper presents an intelligent COVID-19 detection system based on cloud-edge deep learning fusion.The system collects cough sounds using an Android application and automatically provides detection results.Due to disparities in computational capacity between edge-side and cloud-side, the system deploys recognition models based on distinct input features.The edge-side uses the TCNN model with unprocessed audio input, while the cloud-side employs the ResNet18 model with log-Melspectrum features.A fusion model is created by utilizing fully interconnected layers and mutual verification to improve recognition accuracy.Experiments are conducted to compare the performance differences between edge-side and cloud-side, as well as the fusion model.Comparing the results to existing models of the same type on the Coswara dataset reveals a higher AUC and sensitivity, especially for identifying positive patients.These findings demonstrate that the fusion model is superior.The intelligent detection system offers a novel method fsor COVID-19 detection and is an efficient auxiliary tool for COVID-19 testing.In addition, users have the option to perform recognition exclusively in the edge-side, ensuring the security and privacy of sensitive data.

Figure 2 .
Figure 2. Edge-side function interface display illustrates the specific architecture of the model.The model's first convolutional layer is a causal convolutional layer, which aims to ensure causality and prevent future information leakage while capturing temporal relationships in time series data.The second through tenth convolutional layers are dilated convolutional layers, which increase the receptive field and introduce spatial dilation, allowing the convolutional layers to perceive a broader range of contextual information and improve model performance.Additionally, the model employs Focal Loss as the loss function to account for the influence of class imbalance, and Dropout mechanisms are added to the final two fully linked layers to prevent overfitting.

Figure 4
depicts the ROC curves of the TCNN model and the ResNet18 model on the test set.AUC is the area under the ROC curve, which can show the model's ability to identify.

Figure 4 .
ROC curves for different networks Due to the limited number of parameters, the average of AUC achieved by the TCNN model is only 0.7196.With more parameters, ResNet18 has a higher AUC of up to 0.7730.This indicates that the probability density curves of positive and negative samples in the classification results of the TCNN model have a large overlap area; hence, the samples cannot be split into two distinct categories very well, and the recognition accuracy is low.The ResNet18 model has improved the recognition performance.The confusion matrix generated by TCNN model and ResNet18 model on the test set are shown in Figure 5.The TCNN model's sensitivity is 0.6592.The ResNet18 model's sensitivity is 0.7672.The difference between the two models' sensitivities shows that inputting one-dimensional audio data into TCNN reduces the computational load, but the effect of learning sample characteristics is weak.Even though the ResNet18 model with log-Mel-spectrum characteristics as input has more parameters, the deeper structure of the model and the optimized PRelu activation function enhance its ability to identify the unique features of positive samples, making it easier to identify positive samples.(a)TCNN (b)ResNet18 Figure 5. Confusion matrix for different networks ICAITA-2023 Journal of Physics: Conference Series 2637 (2023) 012026 IOP Publishing doi:10.1088/1742-6596/2637/1/0120267 0.7347 0.8081 The proposed fusion model achieved an AUC of 0.8417, specificity of 0.7347, and sensitivity of 0.8081.Compared to the TCNN model with an AUC of 0.7196, specificity of 0.6763 and sensitivity of 0.6592, the ResNet18 model with an AUC of 0.7730, specificity of 0.6691 and sensitivity of 0.7672, the fusion model significantly improved AUC, specificity and sensitivity.As depicted in TABLE Ⅲ, the fusion model outperformed the individual edge-side and cloud-side models in terms of recognition performance and generalization ability.The fusion model achieved a higher AUC in contrast to other studies, suggesting that the classifier classified the data more accurately.The fusion model has better sensitivity compared to the method employed by Zhang et al.[9], who utilised parallel ResNet models to identify TFDF and log-Melspectrum characteristics, making it less likely to mistake positive patients for negative ones.The three parallel BiLSTM model used by "Chen"[17] simply averages the three different characteristics of cough, respiration and speech in the model fusion part.Although this technique took into account the advantages of numerous characteristics working together, it neglected the significance of various aspects in deciding the outcomes of recognition, which has great limitations The fusion model in this investigation demonstrated reduced specificity compared to other studies, indicating a need for future development in its capacity to identify negative healthy persons.ICAITA-2023Journal of Physics: Conference Series 2637 (2023) 012026 IOP Publishing doi:10.1088/1742-6596/2637/1/0120268