Real-time breath analysis towards a healthy human breath profile

The direct analysis of molecules contained within human breath has had significant implications for clinical and diagnostic applications in recent decades. However, attempts to compare one study to another or to reproduce previous work are hampered by: variability between sampling methodologies, human phenotypic variability, complex interactions between compounds within breath, and confounding signals from comorbidities. Towards this end, we have endeavored to create an averaged healthy human ‘profile’ against which follow-on studies might be compared. Through the use of direct secondary electrospray ionization combined with a high-resolution mass spectrometry and in-house bioinformatics pipeline, we seek to curate an average healthy human profile for breath and use this model to distinguish differences inter- and intra-day for human volunteers. Breath samples were significantly different in PERMANOVA analysis and ANOSIM analysis based on Time of Day, Participant ID, Date of Sample, Sex of Participant, and Age of Participant (p < 0.001). Optimal binning analysis identify strong associations between specific features and variables. These include 227 breath features identified as unique identifiers for 28 of the 31 participants. Four signals were identified to be strongly associated with female participants and one with male participants. A total of 37 signals were identified to be strongly associated with the time-of-day samples were taken. Threshold indicator taxa analysis indicated a shift in significant breath features across the age gradient of participants with peak disruption of breath metabolites occurring at around age 32. Forty-eight features were identified after filtering from which a healthy human breath profile for all participants was created.


Introduction
Human breath analysis has received considerable attention in the last century as a diagnostic tool for human health.By the early 1950s, a patented, fieldtested 'Breathalyzer' had already been developed and would become the de facto standard for breath alcohol analysis for decades [1].Upon the arrival of gas chromatography (GC) many studies were published in the 1960s on breath analysis, although these tended to focus on only a few abundant components of breath, such as acetone, methanol, ethanol, and isoprene [2][3][4][5].Most studies agreed that pre-concentration was necessary to attempt to probe low abundance analytes within breath.In the early 1970s, Pauling et al created one of the first 'cold traps' to be used to capture breath and urine vapor for GC analysis.Although they erroneously attempted to eliminate gut microbiota altogether as a confounding factor, their analysis nevertheless reported a stunning 250 different compounds within breath condensate [6].This was the first time the true complexity of breath had been recognized.To date nearly 1500 unique compounds have been reported in breath [7].However, many of these compounds have been only tentatively identified and/or have only been reported in the literature once.Indeed, although the major gaseous constituents of breath are well-characterized, the identity, origin, and role of many of these relatively low-abundance compounds contained within breath are less well understood [8].
The analysis of breath has evolved over the years, and today there are many analytical options available.Generally, most techniques deal with either exhaled breath condensate (EBC) or utilize direct in situ breath measurements.EBC consists mainly of water-soluble volatiles and some non-volatiles that have been flash frozen upon exhalation and then stored for analysis later.Volatile components of EBC are typically analyzed via GC-mass spectrometry (MS) or similar chromatographic analysis to take advantage of both a pre-concentration step, as well as the enhanced selectivity of chromatographic separation [32][33][34][35][36].The advantages and disadvantages of EBC as a matrix and various other analytical techniques used for this matrix have been reviewed previously [37].Alternatively, direct breath sampling immediately transfers the breath to the inlet of the mass spectrometer for ionization and analysis.Thanks to significant improvements to modern high-resolution instrumentation, previous issues with sensitivity and selectivity have largely been overcome, and pre-concentration of breath samples is no longer required for good sensitivity.The three most popular techniques for directly analysis of breath are proton-transfer-reaction MS [38][39][40][41], selected ion flow tube MS [42][43][44], and secondary electrospray ionization-MS (SESI-MS) [45][46][47][48].Of these techniques, SESI-MS is the most recently developed, but it has already proven effective at analyzing volatile and semi-volatile components of breath in real-time [49].Additionally, this technique is easily interfaced with modern high-resolution mass spectrometry (HRMS), including Orbitrap mass analyzers, and commercial breath sampling devices to achieve unparalleled specificity in analyte identification [50,51].Direct breath measurement represents the fastest and least invasive technique for analyzing breath and captures the best 'snapshot' of molecules contained within breath at the moment of exhalation.
As previously described, there are significant issues with breath sampling in regard to sample complexity and variability, as well as metabolite identification [7,8,26,[52][53][54].As outlined by Issitt et al, attempts to compare one study to another or to reproduce previous work are hampered by: variability between sampling methodologies, human phenotypic variability, complex interactions between compounds within breath, and confounding signals from comorbidities.Towards this end, we have endeavored to create a more generalized healthy human 'profile' against which follow-on studies might be compared.In the building of this profile, the minimum parameters possible were controlled in terms of regulating the components of breath from human volunteers in order to replicate a more realistic 'real-world' sampling procedure and better characterize endogenous and exogenous features.We characterize 'healthy' in this study as participants who were able to freely give breath and do not have any underlying respiratory conditions, such as COVID-19, that might prevent them being able to give samples.Through the use of direct SESI-MS measurements utilizing a HRMS and custom bioinformatics pipeline, we seek to curate the average healthy human profile for breath and use this model to better understand changes in the profile associated with human metadata.

Study subjects
All collection protocols were approved for human subjects' research by the Human Research Review Committee (CR00008632).All IRB protocols were reviewed and approved by the University of New Mexico Health Sciences Office of Research, Human Research Protections Program prior to the enrollment of human subjects.IRB protocols were also approved by Department of Energy and Los Alamos National Laboratory prior to subject enrollment.In accordance with the Declaration of Helsinki, all human subjects provided written informed consent prior to enrollment.
For this study, 31 subjects (13 female, 18 male) were recruited from individuals aged 18-50 years old, and informed consent was obtained prior to enrollment.Subjects were given no specific instructions for fasting, but recent food and drink intake was recorded.Daily health questions were included within the subject questionnaire in order to protect subjects and study personnel in order to help prevent the spread of COVID-19.The average age of the subjects was 33 ± 8 years for males and 35 ± 8 for females, with age ranges from 18 to 46 years.Of these, sixteen (16) subjects gave three breath samples at differing times of day (morning, afternoon, evening), and fifteen (15) gave single samples, each collected in triplicate for a total of 189 samples.Of the 189 samples that were collected, 78 were in the morning (∼8AM-11AM), 63 were in the afternoon (∼11AM-2PM), and 48 were in the evening (∼2PM-6PM).Attempts were made to keep the division of time point data as close to 1/3 (∼63 each) for each time of day.

Instrumental methods
Breath sampling and analysis for this study was accomplished by integrating a modified MA3100 ion mobility cell (Excellims, Acton, MA, USA) with an Exhalation CO 2 exhalation interface (FIT, Madrid, Spain) mounted to the front end of an Orbitrap HRMS (Thermo Orbitrap Exploris 480, San Jose, CA) (figure 1).Subjects exhaled into a Spirometry Filter 83 spirometry filter (MADA Spirometry Filters, Milan, Italy) which was connected to a heated transfer line lined with 1 /4 ′′ PTFE tubing maintained at 150 • C with a digital temperature controller (AWL, Stoke on Trent, Staffordshire, UK).The heated transfer line was then interfaced with the ion mobility cell via 1/8 ′′ PTFE tubing and introduced the breath sample directly to the ionization region of the SESI source within the cell.The ion mobility cell syringe pump was used with a 100 µl 1710RN gas-tight syringe (Hamilton, Franklin, MA, USA) to introduce 2 µl min −1 of 80:20 (v:v) Optima LC/MS Grade MeOH:H 2 O with 0.1% formic acid (Sigma Aldrich, St. Louis, MO, USA) as the spray solvent to ionize each breath sample.
Exhalations were maintained at a 7 l min −1 flow rate by the Exhalation CO 2 flow controller to normalize the volume of each breath to approximately 1.0-1.2l.The ion mobility cell was set to Open to allow for MS-only acquisition.Each breath sample consisted of a 2 min (120 s) untargeted full scan MS acquisition consisting of an initial 5 s pause followed by a 10 s exhalation and 10 s pause repeated six times (final pause of only 5 s).This resulted in six 10 s exhalations per sample, and each sample was acquired in triplicate.After the three replicates were collected, a final sample was collected in the same manner but utilizing a data-dependent acquisition (DDA) for MS/MS.Full scan MS spectra were acquired in positive mode (V = 2300) from m/z 60-700 at an instrument resolution of 120 000.DDA spectra were acquired with similar parameters.Blanks were also taken before and after the untargeted full scans and the DDA scan, using the same instrumental parameters as the other scans.

Spectrum data processing
The instrument files collected from subjects were first processed using Compound Discoverer 3.3 [55].Metadata used to distinguish each file included Subject Number, Time of Day (morning, afternoon, evening), and Date sample was taken.Files were searched against online databases including ChemSpider and mzCloud, as well as a local database (mzVault) with DDA files used for MS 2 matches and mass tolerances set to 5 ppm.Results were exported as Excel files and used for further analysis.
Raw spectral files were converted to open source mzML format using msConvert [56].Additional processing was performed in R version 4.2.2 [57].Data as mzML was imported into R using the MSnbase package version 2.24.2 [58].Smoothing was performed via the Savitzkay Golay method in the MSnbase package [59,60].Preliminary peak picking was then performed using mean absolute deviation with m/z refinement via the k-neighbors method where k = 1 in the MSnbase package [61].All scans were combined into a single representative spectrum using mean intensities via the MSnbase package.Representative spectra were denoised by removing spectra with intensity below 2 standard deviations in the spectrum.All m/z values were then rounded to the nearest .001 to fix resolution and respective peaks were binned using a sum of intensities.To further account for observed 'jitter' in compound m/z values across samples, local m/z centroids were calculated using local maximums of m/z differences and m/z values ±0.002 were binned with identified local centroids using intensity sums.Identified peaks seen in blank samples taken associated with each sample were then removed from sample spectra as background.

Data visualization and analysis
Data visualization and analysis was performed in R. Features were first ranked by the proportion of samples in which they were observed.Data was first standardized to relative abundance.Data visualization was performed using the ggplot2 version 3.4.1 and plotly version 4.10.1 packages in R [62,63].MANOVA-type analyses PERMANOVA and ANOSIM were performed using the vegan package version 2.6.4 in R [64].Non-metric multidimensional scaling (NMDS) was run using the vegan package in R. Ordination based models including Canonical Analysis of Principle Coordinates, EnvFit and ordisurf were all generated using the vegan package in R. Variance partitioning was performed using the vegan package in R. Optimal feature binning associated with metadata was performed using the opticut package version 0.1.2in R [65].Opticut has a three-tier ranking for strength of association; only the strongest tier of association ('+++') was considered in this analysis.Age of maximum metabolic shift was calculated using the TITAN2 package version 2.4.1 in R [66].TITAN2 was designed to identify environmental thresholds by using indicator species scores to integrate occurrence, abundance, and directionality of taxa responses.Here, we have adapted TITAN2 to look at peak metabolic shift in breath by using age of participant as our gradient and m/z features instead of taxa.

Untargeted analysis
It is a well-known issue that breath studies often suffer from a lack of reproducibility and in many cases, identification is limited to a Level 3 identification according to the standards proposed by the Chemical Analysis Working Group within the Metabolomics Standards Initiative (MSI) [67].Because there is little standardization in both analytical techniques and reporting standards, results in the open literature are often difficult to compare.A recently updated metaanalysis of all reported biomarkers within breath and other biofluids found 1528 distinct compounds have been reported in breath at least once [7].However, of these 1528 compounds, nearly three-fourths (1098, 71.9%) have only been reported within the literature once.Furthermore, of the 1098 compounds with only a single reference, a significant portion (430, 39.2%) originate from a single private communication rather than peer-reviewed literature.In order to help enhance reproducibility and confidence in the biomarkers found within breath, it is imperative that untargeted analyses be performed within a framework, such as that of the MSI.
In order to help better identify features for the bioinformatics pipeline, an initial, untargeted database search approach was taken to the raw data (figure 2, right, yellow).Because of the DDA runs that were collected with each breath sample, MS 2 matches were also able to be made from the data to improve identification.Samples were delineated based upon Participant (1002, 1003, etc) and Time of Day (morning, afternoon, evening).Compound Discoverer produced 15 566 database hits which were then filtered based upon predetermined filter criteria (must be named compound, possess an MS 2 , have sufficient peak rating ⩾4 and area-under-curve ⩾100 000).This filtering left 413 named compounds.After manually removing duplicates, background peaks, and recherché database hits, the final list of 48 unique compounds seen in at least 20% of samples for which a reasonable, tentative Level 2 identification was possible which were used for profile purposes.In a few select instances (e.g.C 6 H 8 ), compounds for which no name but a chemical formula was available were included in our final list.These compounds were included based on the significant evidence for their presence in breath from peer-reviewed literature even when a structure cannot be determined with sufficient accuracy.The complete list of exact masses along with the proportion of samples they were seen in can be found within the supplemental section (supplemental table 1).Aside from using database searches for compound identification, raw spectral data were also concurrently processed using a custom bioinformatics pipeline (figure 2, left, blue).Raw instrument files are converted into mzML files wherein every feature is reduced to an m/z and intensity value.
When managing data sets in software packages not specifically designed for MS data, considerations must be made for mass accuracy, denoising, normalization, etc.These issues are further complicated with larger cohorts, which may include millions of features collected from hundreds of HRMS scans.By including each of these processing steps, we have created a data pipeline that drastically reduces the time needed to process future experiments.These features have been used to investigate how the profile shifts based upon metadata, such as age and sex.

Statistical analysis
Breath samples were significantly different in both PERMANOVA analysis and ANOSIM analysis based on Time of Day, Participant ID, Date of Sample, Sex of Participant, and Age of Participant (p < 0.001) (supplemental table 2).Stochasticity associated with Date of Sampling is undesirable.In order to ensure that Participant ID findings were not biased by Date of Sample, we looked at covariation between these factors (supplemental figure 1).While some covariation was seen (18%), a meaningful proportion of variability in breath features was still able to be attributed to Participant ID alone.NMDS was performed in three dimensions using Bray-Curtis distance and resulted in a reasonable stress score of 0.108.Canonical Analysis of Principle Coordinates was significant for all tested variables (p < 0.001, p < 0.005) with a large proportion of breath features constrained for Date of Sample and Participant ID (42.9% and 33.5% respectively) and smaller constrained values for Time of Day, Sex, and Age (4.4%, 7.7%, and 2.3%, respectively) (supplemental table 3).EnvFit unconstrained models were significant for all variables except for sex.Again, a large proportion  of variability was explained by Date of Sample and Participant ID (78.6% and 47.1% respectively), while a smaller proportion of variability was explained by Time of Day and Age (7.1% and 10.1%, respectively) (supplemental table 4).One common type of metadata that is often overlooked in terms of its effects on 'typical' human breath is age.An ordinal generalized additive model (GAM) model was also fit to the NMDS ordination space for age as a continuous gradient (figure 3).The model was significant (p < 0.001) with 27.8% of variability in breath features explained by age.This GAM model indicates the gradual transition of samples from lower-aged participants (lighter grey lines) to older participants (darker gray lines).
In addition to GAM modeling, we also plotted this age metadata using threshold indicator taxa analysis (TITAN), which indicated a significant shift in breath features across the age gradient of participants.TITAN analysis is used to show directionality of taxa responses and indicate the optimum value of a particular variable, in this case age.Peak disruption of breath metabolites occurred at approximately age 32 for this cohort (figure 4).While TITAN plots are usually used in a taxonomic sense, figure 4 clearly indicates a high degree of variability across the averaged 'profile' in regard to age.In addition, this analysis shows that most of the largest changes to breath occur before the age of 32, after which the degree of change slows.To date, there are no long-term studies of how human breath changes with age with which to compare this data.However, it is a well-known fact that human metabolism changes with age for a variety of metabolic systems [68][69][70].Future studies may need to include characterizations of what compounds are 'normally' found within breath based upon age and explore how these age-related features shift over time.

Sex, time of day, and participant-based classifiers
Optimal binning analysis successfully identified strong associations between specific features and variables.These include 227 breath features identified as unique identifiers for 28 of the 31 participants (supplemental table 4).Three features were identified to be strongly associated with female participants and 1 with male participants (supplemental table 5), although the compounds strongly associated with females have no identifiable endogenous and exogenous source and therefore proved inconclusive.A total of 37 signals were identified to be strongly associated with the time of day (morning, afternoon, evening) in which samples were taken (supplemental table 6).
In terms of identifying subject phenotypes based upon mass spectral features and metadata, the single compound strongly correlated with males (1pyrroline, CAS:5724-81-2) showed the most promise It is well-known that spermine, spermidine, and putrescine are all present within the human respiratory and cardiovascular system [71,72].An older study on specific anosmia to 1-pyrroline, a cyclic imine, indicated that the presence of spermidine, putrescine, and spermine alongside high concentrations of diamine oxidase in semen and male pubic area sweat could very likely produce appreciable quantities of 1-pyrroline via oxidation, and this compound was asserted to be a possible human sex pheromone [73].Human olfactory trace amineassociated receptors are designed to detect low concentrations of small, volatile amines similar to 1pyrroline such as isopentylamine, 2-phenethylamine, and cadaverine, in order to evoke particular social behaviors [74].Although there is very limited evidence for sex-related differences in polyamine concentrations in blood, there is some evidence for increased concentrations of arginine and ornithine metabolites (spermidine, putrescine, spermine) within the breath of males.Nitric oxide, which is also produced via the arginine pathway, has been shown to be 59% higher in males than females [75].Additionally, polyamine oxidase activity has been shown to be higher in several male rat tissues (thymus, spleen, kidney, liver) compared to female rats [76].As more information becomes available linking real-time breath analysis to human metabolism, relationships between compounds seen within breath and their endogenous sources have the potential to become an excellent resource for clinical and forensic applications.

The human breath profile
Utilizing the database search results from the untargeted approach, we identified a total of 48 features that were seen in high frequency in samples and were identified with acceptable confidence based on previously mentioned filter criteria and reporting requirements (table 1).These same features are also visualized as a 'spectrum' analog in figure 5 with the y-axis representing proportion of samples the feature was observed in instead of the traditional intensity.It is worth noting that several of these features have m/z values above 300, which may prove difficult for typical GC/single quadrupole techniques to achieve in terms of mass range and volatility.Nevertheless, higher mass compounds represent an important component of breath and will help unlock future potential for real-time breath analysis in clinical settings [77].It is also worth noting that several features within table 1 did not have an MS 2 associated with them, which is a limitation of the DDA technique used here.Future HRMS analyses should include the new AcquireX data workflow, in which more comprehensive MS 2 spectra with higher quality database hits can be collected which will enable better compound identification.
Each of the features in table 1 has a possible source listed.Many of the features listed within this table have numerous possible sources, such as dodecamethylcyclohexasiloxane.This compound is nearly ubiquitous, and can be found in polishes, waxes, and numerous cleaning and personal care products.Because these features persist after background subtraction, they are significant for the purposes of a human profile.Although fasting can attempt to eliminate certain features from food and drink being detected by breath analysis, completely removing exogenous features in breath originating from consumer products is unfeasible and does not represent the everyday human environment [78,79].In addition to plasticizers and common contaminants prevalent in consumer products, there are many features present that can be traced back to food.These features, their human metabolites, and their bacterial degradation products are important factors to consider not only within a human 'profile' but also play an important role in differential individual breath identities.These features, as well as several common metabolites from active human metabolic pathways, make up the bulk of the volatile and semi-volatile compounds seen from this cohort of subjects.
Our findings suggest experimental design elements, particularly those associated with sampling, are extremely important to breath studies.While we were able to identify and quantify covariation associated with date of sampling due to some participants not being sampled across multiple days, additional experimental design elements accounting for this confounding factor can enhance future studies.For example, spreading sampling for participants across as many days as possible should help account for day-to-day stochasticity.

Conclusions
In this study we were able to identify breath features that represent a human 'profile' of compounds without enforcing dietary intake restrictions prior to sampling.Features were able to be identified within breath that correspond to consumer and cleaning products, food additives, and several metabolic pathways within the body.These features were able to be identified with a high degree of accuracy according to MSI guidelines and included some larger molecular weight compounds (>300AMU) which are rarely seen in breath applications.We were additionally able to identify features that vary due to factors including individual identity, sex, and sampling time of day.These findings suggest the possibility of using individual breath 'fingerprints' for identification and diagnostic purposes in the future.However, this data is necessarily complicated by the presence of both endogenous and exogenous compounds whose origins cannot always be definitively identified; they are included here because of their important role in human 'exposomics' and their contribution to a healthy human breath profile.We were also, for the first time, able to observe and quantify through breath a metabolic shift associated with aging.Future work may be able to show additional important metabolic shifts in older or younger age ranges as well, which could contribute significantly to individualized medicine and understanding metabolic shifts in relation to age through breath analysis.Additionally, larger cohort sizes could help expand upon the profile data that has been shown here in order to improve modeling and feature identification.

Figure 1 .
Figure 1.Instrumental setup for real-time breath analysis with SESI-HRMS.

Figure 2 .
Figure2.Data processing pipeline for breath samples.Samples were processed in parallel using an untargeted database search approach, as well as an agnostic data processing approach.

Figure 3 .
Figure 3. Generalized additive model (GAM) fit to non-metric multidimensional scaling (NMDS) ordination space showing age as a continuous gradient with respect to participant.

Figure 4 .
Figure 4. TITAN plot for age metadata.White dots represent increasing features while black dots indicate decreasing features.The size of the dot is representative of magnitude of shift while bars (dashed for white dots or solid for black dots) represent 95% confidence intervals for age of shift.

Figure 5 .
Figure 5. Healthy human 'profile' from 31 human subjects.The x-axis represents mass-to-charge of features selected, while the y-axis represents the proportion of the total samples in which the feature was found.

Table 1 .
Final list of 48 compounds contained within the averaged healthy human 'profile' with compound identification and the number of samples within which each compound was found.Each compound also has a possible source listed based on most likely exogenous and/or endogenous sources.