Deep learning algorithm for visual quality assessment of the spirograms

Objective. The quality of spirometry manoeuvres is crucial for correctly interpreting the values of spirometry parameters. A fundamental guideline for proper quality assessment is the American Thoracic Society and European Respiratory Society (ATS/ERS) Standards for spirometry, updated in 2019, which describe several start-of-test and end-of-test criteria which can be assessed automatically. However, the spirometry standards also require a visual evaluation of the spirometry curve to determine the spirograms’ acceptability or usability. In this study, we present an automatic algorithm based on a convolutional neural network (CNN) for quality assessment of the spirometry curves as an alternative to manual verification performed by specialists. Approach. The algorithm for automatic assessment of spirometry measurements was created using a set of randomly selected 1998 spirograms which met all quantitative criteria defined by ATS/ERS Standards. Each spirogram was annotated as ‘confirm’ (remaining acceptable or usable status) or ‘reject’ (change the status to unacceptable) by four pulmonologists, separately for FEV1 and FVC parameters. The database was split into a training (80%) and test set (20%) for developing the CNN classification algorithm. The algorithm was optimised using a cross-validation method. Main results. The accuracy, sensitivity and specificity obtained for the algorithm were 92.6%, 93.1% and 90.0% for FEV1 and 94.1%, 95.6% and 88.3% for FVC, respectively. Significance. The algorithm provides an opportunity to significantly improve the quality of spirometry tests, especially during unsupervised spirometry. It can also serve as an additional tool in clinical trials to quickly assess the quality of a large group of tests.


Introduction
Poorly performed spirometry greatly increases the risk of misinterpreting results-such a statement was given by P Enright as the first sentence in his excellent instructional article for spirometry practitioners in 2003 (Enright 2003). Misinterpreted spirometry may lead to underdiagnosis and overdiagnosis of obstructive pulmonary disease, leading to worse outcomes or drug overuse. Since that time, nothing has changed in this matter, including using exaggerated body language to demonstrate phases of the spirometry manoeuvre during patient coaching in clinical care. Spirometry, being a crucial diagnostic tool to identify and monitor diseases such as asthma, cystic fibrosis, and chronic obstructive pulmonary disease, is the most common pulmonary function test. Progressive improvement in healthcare quality worldwide, ageing society, and growing awareness Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence.
Any further distribution of this work must maintain attribution to the author-(s) and the title of the work, journal citation and DOI.
of the impact of environmental pollution on lung diseases cause increasing demand for spirometry testing. Due to several reasons (limited funding, geographical accessibility, and long medical training pathway), this increase in demand cannot be sufficiently followed by the growing number of experienced physicians to perform the tests. In clinical trials, where a large number of examinations are evaluated by a single research group, the support of technicians through (at least partly) automated systems is also of significant value (Malmstrom et al 2002).
In the last decade, several types of portable spirometers have been placed on the market and in clinical and primary care, broadening spirometry testing availability and supporting the notion that quality testing is possible with only limited staff training (Hegewald et al 2016, Carpenter et al 2018. Although the adoption of spirometry in primary care has improved over time, the full potential of office spirometry has still not been realised (Ruppel et al 2018). On the other hand, it has been shown that telemedicine may be an effective tool to enhance the quality of forced spirometry in primary care (Burgos et al 2012).
Appropriate assessment of the quality of the spirometry manoeuvres in general and the flow-volume curve from the forced manoeuvre, in particular, is crucial for properly interpreting the values of spirometry parameters. A key guideline for such a quality assessment is the American Thoracic Society and European Respiratory Society (ATS/ERS) standards, updated in 2019 (Graham et al 2019). The Standards described several quantitative (e.g. back extrapolated value threshold) and qualitative (e.g. no leak, no cough, no glottic closure) conditions to calculate the technical acceptability of the manoeuvre, considering forced expiratory volume in the first second (FEV1) and forced vital capacity (FVC) separately. The first criteria (quantitative) are straightforward to implement in a simple automatic algorithm. At the same time, the latter (qualitative) are not accompanied by required threshold values (nor technical parameters) and, as such, they need to be evaluated visually. Of course, for some examinations, this includes the case of ambiguous assessment when several experts may have different opinions on the same test. It was shown that it was highly unlikely for the numerical quality criteria to successfully replace the visual inspection of the flow-volume curve (Müller-Brandes et al 2014). Recent developments, however, indicate that visual inspection performed by the physician may be replaced or, at least, supported with machine learning (ML) solutions.
The first attempts at automatic forced spirometry curve assessment took place a decade ago and were based on an analytical-empirical approach to the curve shape (Melia et al 2014). At the same time, another study revealed that good quality spirometry tests are attainable by trained research assistants, irrespective of their prior experience in spirometry (Tan et al 2014), indirectly indicating that quality grading issues may be the right task for ML. In 2018, the authors proposed a supervised learning method based on learning several classifiers from previously labelled forced spirometry tests (Velickovski et al 2018). In that study, unique coefficients representing the shape of the spirogram were calculated and used as input to the ML solutions. In 2020, it was proposed that a deep-learning algorithm may help to standardise ATS/ERS spirometry acceptability and usability criteria (Das et al 2020). The authors constructed a convolutional neural network (CNN) trained on the historical dataset of curves assessed by technicians. They reported that their classifier was superior to ATS/ERS quantifiable rule-based methods (an update from 2005 was considered). Simultaneously, such a result showed that the technicians do not strictly apply quantitative ATS/ERS criteria.
Recently, it has been demonstrated in a retrospective study that ML software may be helpful in clinical trials to ensure the quality consistency of spirometry evaluation promptly in comparison to human over-readers (Topole et al 2021). In a very recent study (Wang et al 2022), spirometry PDF files collected during hospital examinations were labeled according to ATS/ERS 2019 criteria, and a deep learning model was constructed to determine acceptability, usability, and quality rating for FEV1 and FVC. In their solution, the authors proposed a model containing a rule module for the numerical examination of ATS/ERS quantitative criteria and an object detection module to analyse curve images. The output of the automatic assessment model combines information from the numerical criteria and the deep learning model, resulting in acceptable/usable/incorrect labels and letter grading. In 2019, it was reported that ML-based solution outperformed clinical pulmonologists in interpreting pulmonary function tests (Topalovic et al 2019). However, significant concerns of spirometry practitioners were related to data selection for the test set and representation of the population in the training set (Gonem and Siddiqui 2019). On the other hand, the low success rate of correct diagnosis for human pulmonologists in the study revealed an urgent need for a computer-aided interpretation in that domain (Delclaux 2019, Gillissen 2019. The study confirmed that several pulmonologists might interpret the same pulmonary tests differently. Moreover, it revealed that pulmonologists do not always apply ATS/ERS criteria for test interpretation, which supports the idea of implementing automatic interpretation in diagnostic systems (van de Hei et al 2020). A separate field of recent developments is spirometry quality assessment in the so-called sound spirometry, i.e. spirometry without the supervision of a physician where a diagnostic signal is sound recorded through the mobile phone (Pinho et al 2018, Viswanath et al 2018, Almeida et al 2020, Rahman et al 2020 or tablet microphone (Almeida et al 2019). Smartphone spirometry is particularly susceptible to poorly performed efforts because of environmental sounds or mistakes in the effort. In sound-based spirometry, audio features are derived using standard audio-processing techniques (e.g. spectrograms) and then analysed by ML classifiers. Although authors report acceptable/satisfactory performance of the quality assessment solutions (for instance, 93.3% of high fidelity recordings were recognized in (Rahman et al 2020)), sound-based solutions still are broadly recognised as pre-screening methods rather than as clinical systems due to the lack of evidence for clinical precision.
The aim of our study is to create an algorithm for robust assessment of the spirograms' quality based on the ML classification model trained on the experts' references. The algorithm was developed for supporting physicians in quality evaluation and replaces manual quality assessment when the examination is performed by the patient without supervision. However, the aim of the algorithm is not to replace quantitative criteria defined by the 2019 ATS/ERS Standards (as compared to the reference solution, which, de facto, replaces these Standards (Wang et al 2022)) but review these spirograms that met them in the automatic evaluation, based on the threshold values for such parameters as back extrapolated value (BEV), time to PEF (TPEF), etc.

Study design
In this retrospective research, we present and evaluate an algorithm for the automatic spirometry spirograms quality assessment based on CNN. A basic assumption of the present algorithm is that ATS/ERS numerical (quantitative) criteria are always applied to the spirometry curves rigorously. Thus, a curve labelled as incorrect due to the numerical criterion (e.g. BEV) is not further considered in the study (contrary to the prior art (Velickovski et al 2018), Wang et al 2022). Among qualitative ATS/ERS criteria (e.g. a leak, a mouthpiece obstruction, glottic closure, cough), cough detection has been implemented as a standalone AI algorithm described previously (Soliński et al 2020). As a result, the data used to construct ML quality assessment contained only these curves labelled as correct (acceptable or usable). The deep learning model is intended to replace the need for deeper visual examination by the highly-experienced physician. The such further visual examination may result in the need to repeat the manoeuvre despite fulfilling numerical criteria.
Considering the above, the present solution may be understood as a tool to assess the most unobvious cases from the borderline of correct and incorrect spirograms, including the cases where several physicians may disagree on the final quality assessment. Each spirogram was visually inspected by a group of four experienced pulmonologists (called later Experts) using a web-based panel developed for the study. The examples of spirograms reviewed by experts are shown in figure 1. According to ATS/ERS 2019, spirograms were assessed considering the acceptability and usability of FEV1 and FVC parameters separately. Each expert was asked to select one of the two states for each parameter: 'confirm' or 'reject'. By choosing 'confirm', the expert was approving the previous assessment of spirometry (either usable or acceptable), and by choosing 'reject' spirometry, the assessment was degraded to incorrect. All spirograms, flow-volume curves, and numerical results were displayed for each manoeuvre. Finally, FEV1 or FVC was set as confirmed/rejected when at least 3 experts made coherent decisions. Spirograms with an equal number of confirmed or rejected decisions (2:2) were excluded from the analysis. The schematic workflow of the quality assessment process, including the CNN algorithm, is presented in figure 2.
3. Data 3.1. Data source All spirometry spirograms were obtained from the AioCare system (HealthUp, Poland), consisting of a portable spirometer, a mobile application and a data storage cloud. The spirometer is a medical device, class IIa, to gather the airflow signal, which is further transmitted to the mobile application for comprehensive analysis. The system presents to the user clinically important parameters as (not limited to) FEV1, FVC, FEV1/FVC, or PEF. The correctness of the examination and the potential need to repeat the measurement is evaluated in the system in real-time. The research was carried out following the principles of the Declaration of Helsinki.

Reference data
The model was trained and evaluated using 1998 randomly selected spirograms from the AioCare database collected during two clinical, cross-sectional studies (Dąbrowiecki et al 2021, Jankowski et al 2021. The first study aimed to evaluate the usage of mobile spirometry in a primary care setting. Spirometry tests were performed on the general population at primary health centers across Poland (673 operators) (Jankowski et al 2021). The second study was conducted among preadolescent children in two cities in Poland characterized by different pollution levels (Dąbrowiecki et al 2021). In both studies, spirometry tests were performed under the supervision of trained technicians and general practitioners. The selection criteria were constructed to cover the maximal range of FEV1 and FVC while simultaneously retaining uniform distributions for these parameters.
The histograms of FEV1 and FVC values are presented in figure 3. Table 1 shows the population characteristics of the selected dataset.
The results of manual annotation showed moderate agreement between experts. The Kappa classification coefficient for FEV1 and FVC equaled 69.8% and 71.2%, respectively. The number of confirmed classifications for each expert and parameter, as well as detailed agreement results between the experts, are presented in table 2. After rejecting the spirograms with no agreement between at least 3 Experts, a set of 1952 (1655 confirmed + 297 rejected) curves for FEV1 and 1962 (1578 confirmed + 384 rejected) curves for FVC were used for the Figure 1. Examples of spirograms selected for the study. A-good quality graded spirogram, B-bad quality (due to glottis closure in 1 s), C-bad quality (due to irregular exhale curve). development of the algorithm. Data were randomly split into 80% training and 20% testing sets, separately for FEV1 and FVC parameters (table 3). To balance data in training sets, every curve labelled as rejected appeared 5 times in the training set for FEV1 and 3 times for FVC.

Classification algorithm (CNN)
The input of the classification algorithm was the flow-time signal. Flow-time data contains all information present in the volume-time and flow-volume as the latter is derived from the flow-time. All flow-time signals are downsampled from 100 to 50 Hz. For both FEV1 and FVC quality classification, all 750 samples of downsampled data were used (15 s). For signals shorter than 15 s, zeroes were appended to the end of the input (zero-padding).
The architecture of our models is shown in figure 4. We used a one-dimensional CNN for flow-time signals. The architecture consisted of 3 convolutional layers followed by a 4-layer perceptron. Each layer contains a 1d convolution operation followed by batch normalisation and ReLU activation function. The last layer also     The full agreement is when all 4 experts gave the same opinion on the curve. Majority voting is the agreement between at least 3 Experts. contains adaptive max pooling over time with an output size of 1. In all three blocks 128, 256 and 128 convolution filters were used, respectively. The size of convolution kernels was 50, 10 and 5 for FEV1 classification. For FVC, the size of kernels was 80, 40 and 5, which is the only difference in the architecture of FEV1 and FVC classification (kernel sizes were determined during prototypying stage for maximizing performance on the validation set). The number of neurons in fully connected layers was 512, 512, 512, 1, where 1 is the output neuron with binary classification. For hidden layers, ReLU was used as the activation function. In the developing process, we also validated some reduced versions of that architecture (with no perceptron layer, less channels, and combined), but the preliminary results indicated a worse performance in comparison to the final model. Each fully connected layer is preceded by a dropout with a rate equal to 0.1, 0.2, 0.2 and 0.3, respectively. Sigmoid function was applied to all output values from the last fully connected layer. The network was trained for 20 epochs for the FEV1 parameter and 10 epochs for the FVC parameter. A mini-batch was equal to 256 for both parameters. Parameters optimization was carried out using a negative log-likelihood loss function and ADAM optimiser. ADAM parameters were: learning rate = 0.001 and epsilon = 0.1 for FEV1 and learning rate = 0.0005 and epsilon = 0.0001 for FVC. The optimal decision thresholds in the output layer were estimated using the ROC curve analysis and equaled: 0.531 for FEV1 and 0.368 for FVC. The optimal architecture and hyperparameters (learning rate, number of epochs, batch size, kernel size, number of kernels, activation functions, number of layers and neurons) of neural networks were tuned using 5-fold cross-validation using the data from the training set. The model was developed and implemented in PyTorch 1.8.1. The model's training was performed on NVIDIA GeForce GTX 1650 Ti GPU. It took about 5 min to train one model.

Performance metrics
The performance of the algorithm was evaluated using metrics such as accuracy, sensitivity and specificity, calculated based on the number of true/false cases. We defined a true positive case when the classification algorithm confirmed the measurement of FEV1 or FVC, which the experts also confirmed. In other words, when the algorithm correctly classified good-quality measurement. An additional performance measure was the ROC curve which allows the calculation of the area under the curve (AUC).

Results
Confusion matrices for both parameters are shown in table 4. They contain the data from the test set. Based on these results, we calculated accuracy, sensitivity, specificity and AUC, which were equal to 92.6%, 93.1%, 90.0% and 0.948 for the FEV1 parameter and 94.1%, 95.6%, 88.3% and 0.976 for FVC parameter. The performance metrics with binomial proportion confidence intervals (95% CI) are shown in table 5. ROC curves are presented in figure 5.

Discussion
The results obtained in the previous section need to be thoroughly interpreted as many factors affect the study. The accuracy, sensitivity and specificity results for FEV1 and FVC are all (with one exception) above 90%, retaining a high AUC (0.98 for FVC). The AUC values were also high in sub-groups regarding age ( . It may be regarded as a satisfying level taking into account limited agreement between data labelling experts (no more than 83% of full-agreement cases, kappa coefficient 0.70 for FEV1 and 0.71 for FVC). From this point of view, the solution reflects the application of expert knowledge. Standardisation of knowledge and overall clinical experience is a key goal in healthcare worldwide and one of the main benefits of AI in the medical field. Standardisation reduces variation, makes treatment more predictable and reliable and decreases iatrogenic errors. Many studies pointed out the high dependence of spirometry quality on technician's experience, a low agreement between usefulness in clinical practice and ATS/ERS criteria and a low agreement between experts themselves (Burton et al 2004, Landman et al 2011, van de Hei et al 2020, Jankowski et al 2021. In 2020, van de Hei in a real-life study found high clinical usefulness (>88%) of spirometry performed in primary care, even though 13% met ATS/ERS criteria (van de Hei et al 2020).
We should remember that a single spirometry examination is only a part of the broader clinical picture of the patient and only a step in the diagnosis and treatment ladder. Therefore further studies relying on supervised ML techniques should include an investigation of factors influencing the expert's decision, not only in terms of adherence to criteria but also in clinical settings and expert's background, because the approach to usability may vary between specialists and particular cases.
The sensitivity of our algorithm is higher than specificity, which we found acceptable as these numbers are still similar. In case a minimisation of the false-negative rate was the property of interest due to the functional needs of the application (to prevent unnecessary rejection of manoeuvres), the final classification threshold in the output layer of CNN may be optimised for that purpose.
At this point, we must relate our research to the preceding studies. The closest prior art we have recognised are (Velickovski et al 2018, Wang et al 2022. There are several similarities and several differences between those studies. The first issue is preparing the data. In Wang et al (2022), a different labelling scheme has been used as each sample is assessed by a single expert only. Thus, there are no agreement metrics, therefore, no crossinformation on the labelling quality. The second significant difference is data selection as the input to the ML algorithm and the purpose of the operation. In our case, the algorithm takes as input only those samples which were previously positively screened following numerical ATS/ERS quality criteria (and for cough as previously reported (Soliński et al 2020)). In the case of the solution presented in Wang et al (2022), the algorithm is intended to replace whole quality evaluation, including sample analysis, which the numerical ATS/ERS criteria could previously reject. A similar idea to replace or re-standardise acceptability criteria using ML solutions was also researched in Das et al (2020), where authors defined their letter grades.
As disclosed in the appendix in Wang et al (2022), a significant number of the samples (24.7% for the external test set) classified as incorrect could be classified as such due to the quantitative criteria (BEV, Rise Time, unsatisfied EOFE, which may be results of, e.g. insufficient patient training or effort). Therefore, the algorithm in Wang et al (2022) seems to be run on 'easier' data. Since Wang et al (2022) tends to replace the ATS/ERS criteria instead of supporting them, as in our case, Wang et al (2022) presents separate results for acceptability and usability. This causes the interpretation of the results between studies to be no longer straightforward.
An undoubted advantage of our algorithm is high AUC (0.948 for FEV1 and 0.976 for FVC) in contrast to the study of Velickovsky et al (0.88, (Velickovski et al 2018)). AUC for Wang et al (2022) has not been disclosed. In the case of the relation to Velickovski et al (2018), our study as well as Wang et al (2022) showed that classifiers based on convolutional solutions outperform those based on the shape-derived parameters as used in (Velickovski et al 2018) (namely, these were coefficients of polynomials fitted to ascending and descending parts of forced manoeuvre flow signal). One may observe that applying CNN instead of analytical parameters leads to better exploitation of the information in the signal as the fitting error unavoidably accompanies fitting and minor disturbances in the signal are prone to be not considered at all in the analysis. Further consideration shows that agreement between experts in Velickovski et al (2018) was relatively low (maximal kappa between experts was 0.53 compared to 0.70 for FEV1 and 0.71 for FVC in our case), which is a significant observation as lower-quality input data are expected to present lower performance. Additionally, to shape features, BEV and FET spirometry parameters were also given as the input to the tested classifiers. In the case of using CNN, we considered applying those parameters redundant, considering that any correlation between BEV and FET and the classification result would be reflected in the respective convolution layers.
Much effort has been devoted to making the dataset diverse. Spirometry manoeuvres that were used to create the algorithm were performed by a wide range of FEV1 and FVC values.
Relatively high agreement between experts was due to their vast experience in the field. Additionally, a meeting with experts was held before the actual quality assessment process. The meeting agenda included initial training and reaching a consensus on quality assessment criteria based on general discussion and joint assessment of a few complex cases.
Our study also had some limitations. The algorithm output is binary (rejected/confirmed) and does not provide a specific reason for manoeuvre rejection. The algorithm could be extended on some explanatory (saliency-based) method in the future. Dataset did not provide information about health status of the patients. In addition, the decision is based on one manoeuvre without information about the patient and his previous attempts. Also, the experts' assessment was limited because they did not see the patient or the spirometer during the manoeuvre. We have used separate models for FEV1 and FVC, although other approaches, such as Task Adaptive Parameter Sharing, are possible and may be explored in further works (Wallingford et al 2022).

Conclusion
Many approaches are used to provide guidance and support for spirometry interpretation and quality control. In this study, based on the experience of four independent pulmonologists, we trained and evaluated a CNN which classifies FEV1 and FVC independently as acceptable/usable or unacceptable due to ATS/ERS 2019 criteria with high AUC, 0.948 and 0.976, respectively. The presented model can be successfully implemented in clinical practice with high efficiency and reduction of variation in spirometry quality interpretation. However, it is important to notice that the supervised model is as strong as experience and agreement between experts labelling data. Therefore further efforts should focus on the standardisation of experts' knowledge, including a clinical picture of a patient and its implementation into ML-based decision support systems.

Data availability statement
The data cannot be made publicly available upon publication because they contain sensitive personal information. The data that support the findings of this study are available upon reasonable request from the authors.