Sleep Apnea Detection Based on Snoring Sound Analysis Using DS-MS neural network

Sleep apnea hypopnea syndrome (OSAHS) is a high-incidence disease with serious harm and potential dangers. Currently, the traditional scheme for monitoring sleep quality mainly focuses on monitoring two physiological signals: electroencephalogram (EEG) and heartbeat. However, in the sleep state, respiration is also an important physiological signal. This paper proposes a sleep apnea detection method based on snoring sound analysis using deep learning. Firstly, snoring sound signals are preprocessed and feature extraction is performed using Mel-frequency cepstral coefficients (MFCC). The extracted features are then used to train a DS-MS neural network model, and the optimal detection model is obtained through iterations. The experimental results show that the accuracy of the proposed detection model can reach 94.17%.

screening model based on a three-layer BP neural network using PPG signals.The model was trained using ten-fold cross-validation.
Based on the traditional sound feature analysis method, in [4], the author mainly collected, quantified, preprocessed and segmented the snoring sound signal, and then used a short-time Fourier transform to calculate the energy in different frequency bands.According to the threshold, the snoring segments were counted to distinguish between OSAHS patients and normal conditions.
In [5], the authors proposed a machine learning-based analysis method for detecting obstructive sleep apnea hypopnea syndrome (OSAHS) based on audio signals generated during sleep.First, the energy envelope of the audio signal is calculated and normalized by subtracting the mean value.When the result is negative, a suspected OSAHS event is detected.The respiratory rate and non-respiratory rate of the suspected period are then calculated to obtain six features: respiratory event duration, respiratory energy change, average ventilation, and average energy value.These features are used as inputs to a binary random forest classifier, and an adaptive threshold based on the patient's score distribution is used to classify the results and determine whether OSAHS is present.
As artificial intelligence rapidly advances, analysis methods based on deep learning have received more attention.Compared to traditional machine learning, deep learning has multiple hidden layers in addition to the input and output layers.Common deep learning frameworks include convolutional neural networks (CNN), recurrent neural networks (RNN), long short-term memory networks (LSTM) [6], etc.Among them, CNN is the most popular deep learning structure, which is a multi-layer neural network designed to mimic the human visual system.
In convolutional neural networks, the actual performance is closely related to the network structure designed.Different sizes of convolution kernels correspond to different levels of feature abstraction.Existing methods based on convolutional neural networks usually use a single convolution kernel for feature representation, without considering the impact of different levels of features on the model.In this study, we chose to use a Dual-Stream Multi-Scale model for the detection of respiratory pause events.Snoring signals were selected to detect sleep apnea syndrome.Snoring signals are easily affected by OSAHS events, and the signals are easy to collect and can be accurately measured by microphone devices.

2.1.Snoring Data Preprocessing
Since the original audio signal is dynamic and non-stationary, the audio signal needs to be preprocessed through frame-by-frame processing to ensure that each frame of the signal is short enough to reach a stationary state, thus providing a stable input signal for subsequent feature extraction.Considering the relatively stable duration of the indistinct variation of the mouth shape during vocalization and to ensure enough vibration cycles, a frame length of about 20ms is usually collected.Specifically, the signal is pre-emphasized and then collected into a designated unit through a certain number of sampling points, so that the parameters between adjacent frames can be smoothly transitioned.The frame formula is as follows: The windowing operation involves multiplying each frame by a Hamming window to reduce spectral leakage in the frequency domain.The formula for this operation is expressed as: Where T[] is the framing function, x(m) is the snoring signal under test, w(n) is the sequence of Hamming window, and h(n) is the filter associated with the Hamming window function.

2.2.MFCC Feature Extraction Algorithm
MFCC is a feature extraction algorithm that analyzes the frequency spectrum of snoring audio based on human auditory experiments and therefore obtains better snoring audio features by simulating the auditory characteristics of the human ear using Mel frequency.Because Mel-frequency cepstral coefficients hardly change with pitch, they are suitable for processing audio signals of snoring patients.[7]MFCC is a set of cepstral coefficients extracted from the Mel-scale frequency domain, where the Mel-scale is used to describe the characteristics of human ear frequency, and its relationship with frequency is given by the following equation: [8] The signal's characteristics in the time-frequency domain are not easy to observe, so the signal in the time domain is transformed into energy in the frequency domain for observation.Different audio characteristics correspond to different energy distributions.After each frame is multiplied by a window, it must undergo a fast Fourier transform (FFT) to obtain the energy distribution in the spectrum.The power spectrum of the audio signal is obtained by taking the square modulus of the spectrum.The discrete Fourier transform (DFT) of the audio signal is: The filter used is the triangular filter, and the center frequency M is usually 22-26.
The frequency spectrum of the triangular filter is smoothed, eliminating the influence of harmonics and highlighting the resonant peaks of the original audio.Therefore, the MFCC parameter does not display the pitch or tone of the audio.In other words, the audio recognition system using MFCC is not affected by the different pitches of the input audio, which can reduce the computation.The frequency response of the triangular filter is defined as: Finally, the log energy of each filter bank is calculated: Then the MFCC coefficients are obtained by discrete cosine transform (DCT): After computing the log energy, it is then passed into the Discrete Cosine Transform to obtain L-order Mel Frequency Cepstral Coefficients (MFCCs).L-order represents the order of MFCC coefficients.M is the number of triangular filters, cepstral coefficients are the Fourier transform of the log spectrum, and then the inverse Fourier transform is performed.Furthermore, the energy or volume of each audio frame is considered a crucial feature in audio analysis.Therefore, the log energy of each frame, which is the square sum of the frame signal, is usually added as an additional dimension to the basic audio features of each frame.The log energy is computed by taking the base-10 logarithm of the square sum, and then multiplying by 10.This results in each frame having an additional feature dimension consisting of log energy and the remaining cepstral coefficients.

2.3.Convolutional neural network
A Convolutional Neural Network (CNN) is a deep learning architecture that draws inspiration from the human visual system.Compared to other deep learning architectures, CNN typically performs better on image and speech data.This is attributed to the improved capability of CNN to learn higher-level features by leveraging multiple hidden layers, convolutional operations, and well-designed training data.Additionally, compared to traditional manual feature extraction, CNN is capable of discovering complex intrinsic correlated features in high-dimensional data.[9] A typical CNN comprises input and output layers along with convolutional layers, activation layers, and pooling layers.The convolutional layer is the most important part of the CNN.Features are extracted by applying convolutional operations to the input data.For a given input X, the convolution can be expressed as: where W is the convolution kernel, and if X is a two-dimensional matrix, then W is also a two-dimensional matrix.All convolution kernels are the same for the input X, i.e., weight sharing.The activation layer is typically utilized following the convolutional layer or fully connected operation in a CNN.Its purpose is to introduce nonlinearity to the neural network, allowing it to approximate any nonlinear function.Common activation functions include the sigmoid function, the hyperbolic tangent function (tanh), and the rectified linear unit (ReLU), defined as follows: The pooling layer, also known as the subsampling layer, is a structure in a CNN that performs downsampling on the features obtained from the convolutional layer.It is responsible for reducing the spatial dimensions of the feature maps and effectively performing feature downsampling.The operation is similar to the convolutional layer, but the pooling kernel takes only the maximum or average value of the corresponding positions.This operation can reduce the computation of the convolutional neural network and alleviate overfitting to some extent.Common pooling layers include average pooling and max pooling.

2.4.DS-MS neural network model
This article utilizes a DS-MS convolutional neural network to detect sleep apnea.Compared to conventional CNNs, the DS-MS CNN uses multiple convolutional kernels instead of a single one.The underlying idea is that different scales of convolution can extract features of different abstraction levels in deep neural networks.
The DS-MS convolutional neural network structure used in this study is illustrated in Figure 1.The DS-MS model inputs snoring signals into two independent CNNs.One network analyzes the time-series information of the snoring signal, while the other network analyzes the frequency spectrum information.With multi-scale convolution operations, the DS-MS CNN can analyze snoring signals of different frequencies to better identify breathing pauses.Additionally, the DS-MS model can fuse the feature representations of the two networks to obtain more accurate sleep apnea detection results.

3.1.Experimental Settings
The database used in this study is the St. Vincent's University Hospital / University College Dublin Sleep Apnea Database from PhysioNet.The database contains 25 full-night polysomnography recordings from adult subjects with sleep apnea.Sleep stages were scored by experienced sleep technicians according to the standard Rechtschaffen and Kales criteria.

3.2.Experimental Model Setup
In the sleep apnea detection task, the normalized training set was fed into the DS-MS neural network model for training.The temporal branch consisted of three convolutional layers with 16 convolutional kernels, with kernel sizes of 10, 3, and 3 and a stride of 2. The three max-pooling layers had pooling kernel sizes of 3 and a stride of 2. The frequency branch consisted of two convolutional layers, both with a kernel size of (3,3) and a stride of (1,1).The two max-pooling layers had pooling kernel sizes of (2,2) and a stride of (2,2).
In deep learning, the training process is completely supervised by the backpropagation algorithm.Based on the ADAM update rule, the model parameters are optimized by minimizing the cross-entropy loss function.
To validate the proposed method effectively, three commonly used metrics were used: accuracy, sensitivity, and specificity, defined as follows:

3.3.Experimental Result
The performance of DS-MS and conventional single-scale CNN was compared.Figure 1 illustrates the evaluation results of the two methods.As shown in Figure 2, the two methods have similar specificity, but DS-MS outperforms the conventional CNN in terms of accuracy and sensitivity.Overall, DS-MS can further improve the performance of convolutional neural networks.
From Figure 3, it is evident that our proposed method outperforms previous studies.In [10], a combination of HMM and DNN achieved 84.7%, 88.9%, and 82.1% for accuracy, sensitivity, and specificity respectively, which is lower than our proposed method.A traditional SVM method achieved 91.29% accuracy and 89.84% specificity, but it involved complex signal processing.More than a dozen features were extracted, including time-domain, frequency-domain, and nonlinear features.Compared with traditional machine learning classification methods, it is more convenient and has the potential to be used as a preliminary screening tool for patients with sleep apnea syndrome.

4.CONCLUSIONS
In this study, a deep learning model was designed, DS-MS, for detecting respiratory events based on the processing differences of input snoring signals.The PhysioNet database was used for validation, and the proposed method achieved an accuracy of 94.17%, a sensitivity of 87.23%, and a specificity of 86.51%.These results confirmed the effectiveness of our method.Currently, our study is a retrospective study, and in the future, we hope to explore the feasibility of real-time OSAHS detection algorithms based on this work.