Profiling volatile organic compounds from human plasma using GC × GC-ToFMS

Volatile organic compounds (VOCs) originating from human metabolic activities can be detected in, for example, breath, urine, feces, and blood. Thus, attention has been given to identifying VOCs from the above matrices. Studies identifying and measuring human blood VOCs are limited to those focusing on monitoring specific pollutants, or blood storage and/or decomposition. However, a comprehensive characterization of VOCs in human blood collected for routine diagnostic testing is lacking. In this pilot study, 72 blood-derived plasma samples were obtained from apparently healthy adult participants. VOCs were extracted from plasma using solid-phase microextraction and analyzed using comprehensive two-dimensional gas chromatography tandem time-of-flight mass spectrometry. Chromatographic data were aligned, and putative compound identities were assigned via spectral library comparison. All statistical analysis, including contaminant removal, data normalization, and transformation were performed using R. We identified 401 features which we called the pan volatilome of human plasma. Of the 401 features, 34 were present in all the samples with less than 15% variance (core molecules), 210 were present in ⩾10% but <100% of the samples (accessory molecules), and 157 were present in less than 10% of the samples (rare molecules). The core molecules, consisting of aliphatic, aromatic, and carbonyl compounds were validated using 25 additional samples. The validation accuracy was 99.9%. Of the 34 core molecules, 2 molecules (octan-2-one and 4-methyl heptane) have been identified from the plasma samples for the first time. Overall, our pilot study establishes the methodology of profiling VOCs in human plasma and will serve as a resource for blood-derived VOCs that can complement future biomarker studies using different matrices with more heterogeneous cohorts.


Introduction
Many volatile organic compounds (VOCs) stemming from both exogenous [1][2][3][4] and endogenous [5][6][7][8][9] sources can be detected in human biological samples, including breath [3,10], saliva [11,12], urine [13,14], feces [15][16][17], and blood [18][19][20]. A compendium of VOCs identified from healthy human breath and body fluids has been cataloged by Drabińska et al [21]. Of all these matrices, over 1400 VOCs have been identified in human breath. Although one of the sources of breath derived VOCs is through blood exchange in the lungs, the number of studies focusing on cataloging VOCs from blood samples has been limited, yielding only 379 VOCs reported [21]. Current studies involving blood samples focus on monitoring pollutants [22], blood storage and aging [23], or targeted tests such as blood alcohol analysis [24]. The number of studies focusing on identifying human blood, plasma, and/or serum derived VOCs as biomarkers that can distinguish between healthy and disease groups are few (less than ten studies). For example, in 1979, Zlatkis et al [25] had demonstrated the possible use of volatile compounds from human serum in assessing susceptibility to viral infections. Goldberg et al [26] found increasing levels of 3-methylbutanal in blood plasma with increasing hepatic encephalopathic severity. Wang et al [27] showed that the blood metabolic profile of patients with colorectal carcinoma was distinct from that of the healthy individuals, with three biomarkers (phenyl methylcarbamate, ethylhexanol, and 6-t-butyl-2,2,9,9-tetramethyl-3,5-decadien-7-yne) found at significantly lower levels and one biomarker (1,1,4,4-tetramethyl-2,5-dimethylene-cyclohexane) at higher levels in patients. Xue et al [19] identified hexanal, 1-octen-3-ol, and octane as promising blood-derived biomarkers for liver cancer, and Deng et al [20] identified hexanal and heptanal as potential biomarkers for lung cancer. Likewise, Bhatt et al [28] conducted a pilot study and identified nine compounds (acetonitrile, acrylonitrile, carbon disulfide, isoprene, 1-heptene, 3-methylhexane, (E)-2-nonene, hydrogen sulfide, and triethylamine) from the headspace of plasma that could potentially aid diagnosis of esophageal adenocarcinoma. Baseline studies characterizing VOCs from blood are even more limited, predominantly focused on comparing blood metabolites with metabolites from other matrices such as breath [2,18] or urine [29], and have used limited sample size (∼30 samples) [18,30,31].
In this paper we used plasma samples collected for routine blood work from 97 apparently healthy adults to profile the VOCs. We defined 401 features as the pan volatilome of human plasma and classified 34 as core molecules which were abundantly present in all the samples with low variance (<15% variance). The core molecules were validated and had 99.9% accuracy.

Study participants
Complete blood count (CBC) specimens were obtained from 72 participants between May 2015 and April 2018 from Dartmouth-Hitchcock Medical Center, Lebanon, NH. Excess blood from the CBC test was set aside for this study. Ethics approval for sample collection was obtained under study 00028773. Deidentified patient demographic and their basic medical history was obtained, specifically patient age, body mass index, sex, current medications list, smoking status, chronic disease history, acute disease history (within the last six months). An additional 25 samples were obtained and used as validation set. The sample details are provided in table 1  and supplementary table S1.

Sample acquisition and preparation
Blood samples, in 4 ml K 2 EDTA BD vacutainers, were transferred to the lab and then centrifuged in 15 ml conical tubes at 3000 rpm for 15 min using Allegra Benchtop 6R Centrifuges (Beckman Coulter, Indiana, United States) to separate out the plasma. Exactly 1 ml of the plasma was transferred into a 10 ml screwcapped headspace glass vial and stored at −20 • C prior to analysis.

Sample analysis and data alignment
Plasma samples were incubated for 10 min at 37 • C with agitation (250 rpm). Headspace molecules were concentrated using a 2 cm Divinylbenzene/ Carboxen/Polydimethyl-siloxane solid-phase microextraction fiber (Supelco, Bellefonte, PA), which was suspended in the plasma headspace for 60 min at 37 • C with agitation (250 rpm). The compounds were separated and analyzed using two-dimensional gas chromatography tandem time-of-flight mass spectrometry (Pegasus 4D, LECO Corporation, St. Joseph, MI) equipped with a multi-purpose sampler (Gerstel, Linthicum Heights, MD). Sample introduction was performed in spitless mode with a 480 s desorption time at 270 • C. Other detailed instrumental parameters were provided in table 2.
The Statistical Compare feature of the ChromaTOF software (version 4.50, LECO Corporation) was used for processing and aligning the chromatographic data. For peak identification, the signal-to-noise (S/N) threshold was set at 50:1 in at least one chromatogram and a minimum of 20:1 in all other chromatograms. The resulting peaks were putatively identified using the National Institute of Standards and Technology (NIST) 2011 library. The first-and second-dimension peak width were set at 10 s and 0.12 s, respectively. For the alignment of peaks across chromatograms, maximum first-and second-dimension retention time deviations were set at 6 s and 0.3 s, respectively, and the inter-chromatogram spectral match threshold was set at 600 (of 1000).

Data analysis
The overall workflow of the study, including data processing and statistical analysis is summarized in supplementary figure S1. All statistical analyses were performed using R version 4.0.5 (R Foundation for Statistical Computing, Vienna, Austria). Before splitting the dataset into training (n = 72) and validation (n = 25) sets, contaminants were removed from the dataset. Contaminants were selected based on the unsuccessful separation in the early stage of chromatograph due to their chemical properties such as low boiling point [32] (e.g. ammonium, carbon dioxide, nitrogen, carbon disulfide, hydrogen sulfide, nitrous oxide, boron trifluoride, and nickel), which result in unreliable peak identification. Contaminants were also selected based on coating materials of gas chromatography (GC) column, plastics from experiment materials [33,34] (e.g. hydrogen cyanide, dimethylacetamide, oxalic acid and oxalic acid esters, and sarcosine), chemical solvents or detergents used in lab environment (e.g. isopropyl alcohol, methylene chloride, and trichloromethane) (supplementary  table S2). After applying a further S/N cutoff of 500:1, the features (VOCs) in the training set were normalized using probabilistic quotient normalization [35] and log 10 transformed. All the remaining features constituted the pan volatilome and were further classified into three subclasses based on their frequency of observation: core, accessory and rare. Core molecules were those that were detected in all samples included in the training set (n = 72), rare molecules were defined as those that were detected in <10% of samples (n ⩽ 7), and accessory molecules were defined as those detected in ⩾10% samples and <100% of samples (7 < n < 72). The normalized area for core molecules was obtained and variance less than 15% across all samples was considered as low variance [36]. Accumulation and rarefaction curves were generated for the pan and core volatilome, following the well-established methodology used for the analysis of genomic data [37,38]. Pan volatilome compounds were added to the accumulation curve and core molecules were added to the rarefaction curve to assess the adequacy of sampling. A subset of data was selected at random without replacement and a total of 500 iterations were performed to generate the curves. The occurrence of each volatile feature was indicated in the binary scale for each sample (0 indicated absence and 1 indicated presence).
Next, the proportion of observed peaks of core molecules were compared between the training and validation datasets. In order to evaluate the presence of the core molecules in validation dataset, the S/N cutoff was set to 50:1. Due to the unequal sample size between training and validation datasets, we proposed the following validation metric: where a is the number of core molecules, n v is the validation sample size, and n miss is the number of samples in which the core molecules were not observed.

Reporting of core VOCs
The mass spectra of each core molecule (identified from 72 training samples) were visually inspected to verify that they were well matched across aligned samples. Specifically, the m/z and relative intensities of the four most abundant peaks of a given core feature were compared across training samples as well as with the NIST 2011 library. According to the Metabolomics Standards Initiative criteria [39], the core features identified from this study belonged to level 2, 3 or 4 of the metabolite identifications. They were further classified into four classes, as listed below, based on (1) top ten hit results from comparison to the NIST 2011 library, and (2) forward similarity (spectral match score) for each hit. This classification was used to express varying degrees of confidence in the putative identifications assigned to features. Specifically, Class A (highest confidence): a putative name was assigned to any feature that had at least four same hits with a mass spectral match score of ⩾850 (of 1000). Class B: a putative formula was assigned to any feature that had at least five hits sharing the same formula with a mass spectral match score of ⩾700. Class C: a putative class was assigned to any feature with a mass spectral match score of ⩾600 and if the top five hits belonged to the same compound class. Class D (lowest confidence): features were classified as 'Unknown' if they did not satisfy any of the above criteria for classification.
In order to assess the adequacy of sampling for our study, we conducted an accumulation (figure 2(a)) and rarefaction (figure 2(b)) curve analysis for the pan volatilome and core volatilome respectively. Both curves approached an asymptote (where the variance in volatilome size became small and unchanging with the addition of samples), suggesting that the 72 samples in the training set sufficiently captured the full variation of pan and core volatilomes.

Classifying and validating the core volatilome
We classified [39] the 34 core molecules into four classes (explained in the methods section). Seven molecules were assigned a putative name (table 3), 16 molecules were given a putative formula (supplementary table S3) and 8 molecules were given a putative class (supplementary table S4). Three molecules were classified as unknown (supplementary table S4). We calculated the normalized area variance of every core molecule. Thirty out of 34 core molecules had a variance of less than 5% across all samples in the training set and the remaining 4 molecules had a variance between 6% and 12%, suggesting that these core molecules were consistently present in all samples with low variance, irrespective of sex or health status.
A validation check of the 34 core molecules revealed that all but 1 was present in all 25 samples. The other molecule, a cyclic ether with putative formula of C 4 H 8 O, was detected in 24 samples and only missing in 1 sample. Thus, our validation accuracy was 99.9%.

Discussion
The goal of our study was to define and characterize the VOCs present in the headspace of human plasma samples. We identified 34 core molecules which were present in all 72 training samples and validated with 99.9% accuracy in the validation set consisting of 25 samples. The core molecules belonged to different classes, the predominant being aliphatic hydrocarbons. The study participants included male and female adults (supplementary table S1) with underlying hypertension (n = 27), diabetes (n = 11) and also included smokers (n = 8) and patients with history of smoking (n = 41). Therefore, the core molecules identified from this study can be considered as being present in all conditions investigated in this population, irrespective of sex and health status which sets the basis for a baseline study.
Of the 34 core molecules, only 7 were assigned a putative name and we restrict our discussion to these 7 molecules. Four of these seven molecules (butan-2one, benzene, styrene and toluene) have already been identified from human blood samples [2,40,41]. Exogenous sources such as pollution [42], smoking [43] and working environment [44] are likely contributors to their presence in blood. However, higher concentration of butan-2-one was reported in human blood and urine samples collected from patients with liver cancer [45] and breast cancer [14], respectively. Similarly, benzene [46], toluene [47] and styrene [48] have been classified as potential carcinogens, indicating that these molecules may be closely associated with disease conditions as well. Therefore, a baseline study confirming the presence of these molecules in healthy human blood is critical to monitor the concentration of these compounds in disease conditions.
One of the seven core molecules with assigned name (2,4,4-trimethylpent-1-ene) was identified in aged and dried human blood sample, sampled periodically over 12 months from a non-porous (aluminum) surface [49]. Although this molecule has not been reported from blood samples in a disease setting, it has been identified from breath samples in a lung cancer study [50]. In this study by Rudnicka et al, breath samples were collected from non-smokers and smokers who were healthy and/or had lung cancer. Although the molecule did not show a significant difference between healthy and disease groups, it was identified in all the samples, irrespective of the disease status or smoking status. This is in line with the   Table 3. The seven core molecules were assigned putative names based on the Class A criteria described in the methods. First and second retention times are the mean primary and secondary retention times across all 72 training samples for that molecule. The forward similarity score shown here is that of the highest matching library hit. Two core molecules with putative names, octan-2-one and 4-methylheptane, have never been identified from blood samples. Octan-2-one is a synthetic flavoring substance used as food additive [51]. It has been identified in exhaled breath samples collected from baboons with cardiometabolic dysfunction [52]. Further, it has also been identified from fecal samples obtained from patients with Crohn's disease, ulcerative colitis and healthy controls [53]. Octan-2one level was significantly different between healthy controls and patients with active Crohn's disease. Although the source of this molecule in human plasma is unknown, we identified octan-2-one as one of the core molecules present in all the samples. Likewise, 4-methylheptane has never been reported in human blood but has been identified from breath samples of children diagnosed with type 1 diabetes [54], albeit no association was found between blood glucose levels and 4-methylheptane. It has also been identified from breath samples collected from individuals who smoke [43] and was also detected at higher levels in lung tumor tissues, when compared to healthy tissues [55]. In addition to these matrices, 4methylheptane has also identified from urine samples of both healthy and children with autism with a frequency of occurrence 65% and 92%, respectively [56]. Both octan-2-one and 4-methylheptane were detected in 100% of samples with less than 5% variance in our study. Their high abundance in plasma samples along with their association with various diseases (based on aforementioned studies) warrants further investigation of this molecule in blood samples collected from disease conditions.
One of the limitations of this study is the lack of more heterogeneous cohorts to establish a baseline signature of metabolites indicating a health status. The inclusion of samples from different ethnic backgrounds, countries, economic statuses, and occupations would help in establishing a universal baseline signature in human plasma. Another limitation of this pilot study is the absence of reference standards to confirm the identity of the core and pan volatilome quantitatively. This necessitated the use of more stringent criteria for identification of molecules. For example, as mentioned in methods, the similarity (spectral match score) was increased from 600 to 850 in at least four same hits in order to assign a putative name to a molecule. Therefore, apart from the seven core molecules, the other core molecules could not be named with confidence, even though some of the molecules with putative formula had a forward similarity up to 960 (maximum is 1000). However, the core molecules with putative names could be used in the future quantitative studies, while the molecules with putative class and formula would help in future qualitative studies, as the information they provide (e.g. molecular mass, number of carbons, possible functional groups and location in GC × GC space) will guide the selection of appropriate chemical standards for future higher-level identity confirmation. Future studies may be supported by the inclusion of reference standards, such as alkane mixtures for assessing retention indices as a supporting method for identifying more peaks especially the accessory and rare peaks, since mass spectral matching alone was not sufficient to identify some of the molecules, especially structural isomers [57]. The third limitation of this study was the use of a very high signal-tonoise threshold (500) during data processing. Using such a high threshold excluded the molecules with lower intensity values and therefore were not included in the compendium of core molecules. However, this threshold ensured that the core molecules were abundant in every sample, thereby fulfilling the purpose of characterizing blood VOCs. Further increase in the threshold did not alter the number of core molecules and therefore we fixed this threshold as the optimal cutoff for this dataset.

Conclusions
Our pilot study establishes the methodology and provides a basis for cataloging the VOCs present in the human plasma samples. The study serves as a resource for blood-derived VOCs that can complement future biomarker studies on healthy subjects as well as those with a disease. Future studies should be larger, involved a more heterogeneous population, and quantify VOCs.

Data availability statement
The data that support the findings of this study are available upon reasonable request from the authors.

Ethical statement
The study was reviewed and approved by Dartmouth College and Dartmouth-Hitchcock Medical Center (Study00028773). This research was conducted in accordance with the principles embodied in the Declaration of Helsinki and in accordance with local statutory requirements. The sample acquisition in this research does not interfere with patient care, and the collected samples do not compose a human genetic biobank and will never be used to ascertain human genetic information. Thus, the collected data has no potential to damage financial standing, employability, insurability, or reputation. Given the nature of the samples used for this study, a waiver of informed consent was obtained.