Concurrent analysis of electronic and ionic nanopore signals: blockade mean and height

Electronic and ionic current signals detected concurrently by 2D molybdenum disulfide nanopores are analysed in view of detecting (bio)molecules electrophoretically driven through these nanopores. The passage of the molecules, giving rise to translocation events in the nanopores, can be assigned to specific drops in the current signals, the blockades. Such blockades are observed in both the electronic and the ionic signals. In this work, we analyze both signals separately and together by choosing specific features and applying both unsupervised and supervised learning. Two blockade features, the height and the mean, are found to strongly influence the clustering and the classification of the nanopore data, respectively. At the same time, the concurrent learning of both the electronic and ionic signatures enhance the predictability of the learning models, i.e. the nanopore read-out efficiency. The interpretation of these findings provides an intuitive understanding in optimizing the read-out schemes for enhancing the accuracy of nanopore sequencers in view of an error-free biomolecular sensing.


Introduction
Solid-state nanopores are nanometer-sized holes in materials that can electrophoretically thread, i.e. translocate, charged species, biomolecules and polymers [1][2][3].These are based on a concept that is being developed in view of biosensing, read-out, and sequencing of (bio)molecules [4,5] and is based on ionic current measurements [6].The electrophoretic passage of the molecules through the pores can be detected through changes, the current blockades, in these current traces.An additional detection approach involves the use of electrodes embedded in the nanopores that can be used to measure an electronic current across the nanopores [7,8].This scheme gives rise to detectable electronic current signals, which include electronic current blockades also related to translocating events [9].Recent experimental attempts using two-dimensional MoS 2 nanopores [9,10] could concurrently measure both the ionic and electronic signals [11,12].In these measurements the translocation events are detected in the blockades of both measurement channels, the ionic and electronic and are correlated [13].The electronic measurements can be realized using embedded electrodes in the device [14][15][16].This bimodel detection inherently involves more measurements, providing richer data sets that can be used for analyzing the experiments and detecting the identity of the threaded biomolecules with a higher efficiency, i.e. a lower error.
In order to realize this and allow for predictions on the molecular identity of molecules in new experiments, the nanopore signals need to be efficiently processed and analyzed.For such purposes, a number of read-out and base-calling algorithms based on Machine Learning (ML) schemes have been proposed [17][18][19][20].In these cases, strategies for the transformation of raw sequencing data into base calls [21] have been developed.More generally, integrating nanoscale sensing with ML has been developing rapidly, for example for the detection of viruses [22].In the case of nanopores, numerous recent proposals on novel read-out schemes based on ML involve among others the use of hidden Markov models [23], neural networks [24] or recurrent neural networks [25] for the Oxford Nanopore MinION nanopore device, convolutional neural networks [26,27], deep neural networks for error correction in the case calling [28].Recently, an integration of supervised ML and transverse quantum transport through a graphene nanopore was able to classify unlabeled nucleotides with their transmission readouts [29].In most of these studies, the aim was to develop a read-out protocol or a pipeline that reduces the prediction error, thus enhancing the sequencing efficiency.However, these involve either ionic or electronic current data from nanopores and do not provide a detailed analysis on the inherent structure in the blockades and the importance of the features therein.Accordingly, an in depth understanding of the exact infromation hidden in the blockade structure and the feature space that could provide a more intuitive approach in the read-out algorithmic development is missing.
In this work, we provide-to our knowledge-for the first time a framework that jointly processes and analyses concurrent ionic and transverse currents from nanopore experiments in order to understand the importance of these measurements and their inherent details mapped on the choice of features.In this way, the path for a guided and physically intuitive base-calling development that can efficiently learn, read-out, and predict the identity of (bio)molecules electrophoretically threading nanopores is opened.Our approach is applied, tested, and evaluated in the case of 2D nanopores threading biomolecules.Along these lines, the paper is organized as follows: we begin with an outline of the proposed methodology and present the available experimental current, both ionic and electronic, time traces from 2D nanopore experiments in section 2. In section 3, chosen features are extracted from all available data and are analyzed through unsupervised, clustering, techniques.At a next step, selective and distinctive features of these data are being processed through clustering algorithms follewed by the application of classification schemes that do assess the feature hierarchies in a potential learning and read-out process.In the end, we discuss the relevance and applicability of our work towards nanopore technologies and analyte detection.

Methodology
We analyze data from experiments measuring both the electronic and the ionic current through a nanopore.In order to achieve this, we pre-process the data in order to be able to extract the more relevant features of these data and further analyze those in view of interpreting the signals and further assess these based on the extracted features.

Experimental data
For the analysis, experimental data from previous nanopore experiments [13] are used.The data include current measurements from 2D MoS 2 nanopores that thread different bio/molecules.The experimental measurements involve both electronic and ionic current time traces (correlated events from [13].These signals are measured at the same time and can be used to detect the passage of different analytes based on the current blockades this leads to.These blockades refer to the translocation events that are further analyzed in this work and 'carry' with them the identity of the translocating bio/molecule.Data from three different experiments with biopolymers electrophoretically driven through the MoS 2 nanopores are analyzed.The biomolecules are a one thousand base-pairs double-stranded DNA (1 kb dsDNA), a negatively charged 80 nucleotides-long single stranded DNA molecule (80 nt ssDNA) and a positively charged poly-D-lysine hydrobromide (polylysine) with an average molecular weight (mw) of 3010 3 to 7010 3 g/mol.In the following we will use the labels 'ssDNA', 'dsDNA', and 'polyLys' for these experiments and analytes, respectively.The translocation events were detected and concatenated using the EPFL software OpenNanopore [30].A short summary of the experimental conditions and the biomolecules used in this work are given in table 1.Note that the two chambers of the experiments, the cis in which the biomolecules are placed before translocation and the trans, which is the chamber in which the biomolecules enter after their translocation through the pore are filled with the same salt at different concentrations in order to assist the electrophoretic motion of all species (anions/cations and biomolecules) Table 1.Overview of the most relevant experimental details and conditions related to the analyzed data.'Analyte', 'pore', 'salt', V ionic , V el , and 'DA' refer to the translocating molecule, the pore diameter, the salt solution, the voltage difference in the ionic and electronic channel, respectively, and the presence of a differential amplifier.
through the pore.For the same purpose, in the polyLys experiment two different ionic voltages were applied.All experimental details are given in the original publication on the nanopore experiments [13].

Detection of translocation events and feature extraction
For each experiment, the different translocation events are present in the raw current signals.These were detected using the OpenNanopore software [30].This software detects the differences in the current time traces and the respective blockades denoting translocation events of the biopolymers.As the measurements are prone to different sources of noise, the latter can be observed in the current signatures.In order to reduce the noise in these and be able to clearly detect the translocation events a filter implemented in the software was used.Higher values of the related filter coefficient (a) will detect fewer events as expected, while lower values would interpret a reasonable fluctuation in the current as an event.On top of this filtering, the start (S) and end (E) thresholds that define an event have to be set accordingly by matching the events in both electronic and ionic channels.For this, different values of these threshold frames were used in the ionic and in the transversal signals.The values for the different experiments were found by visual inspection of each time trace and the correlation of both measurement channels (electronic and ionic) and are summarized in table 2. For the event detection, a maximum time limit for events of 0.1 s, the delay limit between correlated events of 0.005 s and a minimum time limit for events of 0.000 001 s were used.Note that 'correlation' refers to the observation of translocation events within the same time frame for both measurement channels, electronic and ionic.The pre-processing of the data using the parameters listed in table 2 led to the detection of current blockades linking to translocation events that will be analyzed in the following in view of an efficient detection of the analyte type.We could detect 700, 100, and 417 events for the ssDNA, dsDNA, and polyLys analytes, respectively.In figures 1(a,b), we show a snapshot of the original time series of the concatenated data, i.e. current blockades in both ionic and electronic channels including around 15 translocation events.These clearly show correlated translocation events in the two channels and are representative events of all that are analyzed below.This figure also focuses on one of the many events in the measurements, both in the (c) ionic and (d) the electronic channels as denoted in the figure in order to visualize the processing of the events and the feature extraction discussed below.A broader spectrum of these events can be found in the respective experimental work [13], while a view on many similar ionic events have been visualized in one of our similar studies [31].In panel (e) of the same figure the two different measurements in the pore environment are sketched and their directions, also pointing to their different nature, are given.The depicted ionic and electronic blockades are correlated in time, as clearly seen by the time axes and represent one translocation event, i.e. a passage of a single biomolecule, through the pore.The features that were extracted for both channels, ionic and electornic, using the CUSUM algorithm from the data are (a) the number of distinct current values found in a blockade ('lvl'), (b) the current height defined as the difference between the highest and the lowest levels found in a blockade ('height'), (c) the mean current value of a blockade ('mean'), (d) the dwell time, that is the duration of each translocation event ('dwell'), (e) the current drop defined as the difference between the current baseline when no translocation takes place and the minimum current of the event ('drop').In parentheses, the notation of the respective features as used in the following is given.In this notation e.g.ionic height ('height i ') and electronic mean ('mean el ') correspond to the height in the ionic channel and the mean in the electronic channel, respectively.
The type of features extracted, based on the definitions above, are also indicated in panels (c), (d) of figure 1 for one typical event in the data.Note that the analysis described was performed on all three data sets and was consistent.In the following discussion, we graphically represent findings from the ssDNA set, as this was the richest and more informative dataset.The analysis followed is given as a simple workflow in figure 1(f).Starting with the raw nanopore data and the translocation events therein, we process these in order to extract the features.These are in turn evaluated based on clustering and classification algorithms.The latter includes an analysis on the feature hierarchy, i.e. its influence on the classification and learning.The details and outcome of this workflow are clarified step by step in the following, reaching the aim of assessing the features that necessarily need to be included in a nanopore read-out protocol.In order to obtain more insight on the basis of the feature influence in both clustering and classifying/learning, we have applied the SHAP (SHapley Additive exPlanations) analysis [32] on the features extracted from the nanopore data.Briefly, features assigned to higher SHAP values are the most influential ones towards learning.In this context, we have used three different classifiers: (a) a convolutional neural network (CNN) [33,34], (b) an XGBoost-Classifier (XGB) [35], and (c) a Random Forest Classifier (RFC) [36].Overall, we could get an accuracy over 99.52% , as indicated from the performance of all classifiers for the three datasets in table 3.

Clustering: current blockade height
We begin with a pairwise clustering of the features of the type ftr i -ftr j , where ftr = lvl, height, mean, dwell, drop and i, j = i, el for ionic and electronic, respectively (see supplementary information for details).Note that during Table 3. Precision (%) of the three classifiers for the three datasets.For each model, the left, middle, and right columns refer to the ionic (i), electronic (el), and both channels (i-el), respectively.the detection of the translocation events for both the ionic and electronic channels we observed a very good correlation in the events of both channels, also showing the same dwell time.In fact, we have considered all such correlated events in both measurements in the following analysis discarding few non-correlated events.For different channel combinations, either ionic or electronic, two clearly distinguishable clusters were observed in the feature space combining the blockade height together with the mean.This is supported by the figures 2(a,b) for the richest dataset ssDNA.This could indicate the presence of two distinct (most probable) molecular configurations.Especially, in the case of ssDNA, for which the pore size is larger (5.9 nm) compared to the molecule's diameter (roughly 1 nm), the different clusters could be assigned to single-file and folded translocation events.In the latter, the molecule translocates the pore in a folding conformation, blocking more the pore, thus leading to deeper blockades.The combination of the ionic and electronic channels (shown in figures 2(c), (d) for ssDNA) reveals the importance of the electronic signal in better defining clusters, that is clarifying the importance of certain features.This is clearly indicated when comparing panels (b) and (d) in figure 2, as in the latter the clusters are certainly more separated.Using the height in these panels is more efficient than using the mean as in figures 2(a, b), which reveals that the addition of the electronic channel is not as pronounced.Our results and observations clearly underline on one hand the higher quality in the cluster formation, that is more distinct and well separated clusters in the bi-modal ionic-electronic feature space over the respective one-mode spaces, especially when using the blockade height as a feature.This points to a higher relevance of the height over the mean in defining clusters of distinct translocation events (folding, conformational changes, etc.).Accordingly, both the ionic and electronic current blockades include signatures of such distinct events in the respective height values.As a result, analyzing both channels separately can identify clusters.However, in order to clearly distinguishing those clusters and extract more information can be efficiently done through the concurrent analysis of both channels (ionic and electronic) when using the feature height.

Classification: current blockade mean
In order to further assess the blockade features and the hierarchy of the information hidden in the data, we assess the class balance related to the three classification schemes (CNN, XGB, and RFC) applied on the data from all three experiments.The total amount of data points together with the data points represented in each class is depicted at the top panel of figure 3(a).The large class/feature imbalance (filled bars) in this figure of the datasets was balanced applying both over-and under-sampling techniques, which remove data points from overrepresented classes or create new artificial data points for under-represented classes in the training set.This procedure led to equally distributed classes with 300 data points in each experiment.With the balanced datasets, we proceed towards identifying a hierarchy in the features in order to assess their influence in learning and predicting.We calculated the mean SHAP values of the ionic (i), electronic (el), and bi-modal (i-el) channels for each of the three classification schemes separately.The SHAP trends of the dataset ssDNA for two classifiers and the four features, calculated separately for the two channels, are ranked in figure 3(b).Higher SHAP values indicate a stronger impact of the respective feature on the model output/prediction and should be used in training.The results from the classification schemes CNN and XGB are shown.The observed feature hierarchy is in average similar for the three classification schemes.The feature with the larger contribution, i.e. larger SHAP values is the blocakade mean in both channels, followed by the blockade height.According to this figure, for both channels and the different classification schemes, the mean and height are more important over the dwell time and level information in the current blockades.Evidently, the electronic mean demonstrates remarkable consistency across all methods, while there is overall a notable consistency in the ranking of feature importance for classification purposes.
In order to strengthen these findings, the global SHAP values are provided in figure 3(c) for the RFC.Inspection of this panel, strongly implies that the ionic-electronic features mean and height inherently include significant predictive power, regardless of the specific classification approach used, within the range of the expected conceptual algorithmic differences and for all three experiments (additional data are provided in the supplementary information).For a final consistency check, we provide in figure 3(d) the ΔSHAP as a metric to quantify the extent of deviation in the calculated SHAP values across various models, defined as ΔSHAP represents the difference between the maximum (SHAP max ) and minimum (SHAP min ) SHAP values obtained from each model in relation to the overall maximum SHAP value.Accordingly, a lower ΔSHAP implies greater robustness in the SHAP values.The figure points that dwell and lvl lead to a ΔSHAP close to 100%.This can be linked to the fact that these non-important features have often very low SHAP values (close to zero), shifting the deviation to the maximum (100%).Note, that the distance in the SHAP values for the electronic features are more well defined and consistent than those in the ionic channel, leading to smaller ΔSHAP in the electronic mean and height than the ionic mean and height.However, both features in both channels are highly important.Overall, the SHAP analysis clearly points to the fact that of all features, the mean is globally the one that scores best, followed by the height.At the same time, the dwell time and the amount of levels in the blockade are not that important.These findings can all be traced down to the fact that the mean and height features include the most important information of the blockades and the experiment overall, while the dwell time and the number of levels do not contain essential information at the single-unit level (nucleobases, amino-acids) with respect to the experimental details and molecular identity/sequence.The results clearly reveal that using both channels favors the bio/molecul detection, with the electronic mean feature being the one with the highest impact towards prediction.

Read-out protocol proposal: mean versus height
The analysis provided in this work points to certain important aspects that should be considered in read-out workflows of nanopore (bio)sensing experiments.A well-designed such protocol considers the experimental setup, signal acquisition, data processing, event detection, and data analysis in order to accurately characterize, understand, and predict the type/length/sequence of biomolecules.A possible error-free nanopore detection of unknown analytes would rely on a careful and interpretable choice of the algorithmic steps and features.Within this concept, we do not aim to make predictions, rather go one step back in order to provide the ground for making accurate predictions.Accordingly, the analysis could identify the very fine molecular signatures, in form of features, giving rise to a current blockade.In essence the importance of both height and mean is strongly interpretable and intuitive.Supported also by our previous work [31], the feature height includes the multi-level characteristics of a translocating molecule, thus the two extreme configurational changes of the molecule through the pore.The feature mean provides the average information of the bulkiness or volume exclusion of the translocating molecule inherently including the information of its distinct units, nucleotides or amino-acids, which differ in volume, physical properties, and their interaction with their environment.Accordingly, identifying only two interpretable features, not only reduces the learning dimensionality, but also allows the incorporation of domain knowledge to a read-out protocol for multi-modal measurements.This, in turn, enables a potential mapping of significant physical descriptors from the events of multimodal signals, providing insights about the connection between the channels.Based on these points, we propose an efficient physically intuitive and interpretable nanopore read-out relying on the mean and height descriptors in multimodal nanopore data.At the same time, we identify the importance of two distinct steps: clustering of correlated multimodal data pointing to the most efficient and classification indicating a second efficient feature.These two features, blockade mean and height, are expected to be the only necessary information in training data for learning an algorithm able to make predictions with a very high accuracy.In the end, the predictions made in the read-out protocols cited in the Introduction section and others can be enhanced by considering the outcome of our analysis.

Conclusions
Unsupervised and supervised learning techniques have been applied in this work in order to analyze and learn from concurrent ionic and electronic current measurements from 2D MoS 2 nanopores electrophoretically threading biomolecules.The events detected in both channels were correlated and allowed the extraction and comparison of the blockade features: dwell time, number of levels in a single blockade, blockade mean, and blockade height.The analysis of the two-dimensional feature representations revealing clusters and the classification closely linked with the SHAP analysis for the feature hierarchy provided two important and complementary pieces of information: the blockade height strongly maps the correlations of the translocation events, while the blockade mean strongly maps the molecule identity, be it its length, type and/or sequence.The impact of this work can be summarized through the points: (i) only two blockade features, the height and the mean, are sufficient in identifying translocating molecules, which (ii) is consistent in both the electronic and ionic blockades, so that (iii) using these two features in both channels enhances the read-out efficiency/ prediction.As a result, our work has a three-fold impact: (a) points to a nanopore read-out relying on two physically intuitive and interpretable features and sequential clustering and classification steps, (b) proposes a significant feature dimensionality, thus reducing the computational cost, and (c) emphasizes the importance and need of multi-modal nanopore signals for efficient detection.Note, that while the ionic data include more bulky properties of the biomolecules (length, conformation) the electronic data inherently involve more fine differences of the biomolecular units (nucleobases, amino-acids).Incorporating the latter channel in the learning scheme is thus essential in enhancing these differences in view of an error-free sequencing.To this aim, it is necessary to tune the conditions and measurement tools (bandwidth, etc.) and reduce the noise resolving translocation events and blockades that can deliver clearly distinct mean and height features.We anticipate that this would be an extremely important step towards error-free ultra-long read-out and sequencingnon black box -schemes.

Figure 1 .
Figure 1.Part of the original time series of (a) the ionic and (b) the electronic concatenated events taken from the experimental nanopore measurement with ssDNA [13].(b) Representative current blockades for the (c) ionic (left) and (d) electronic current measurements for the ssDNA experiment.On the top, the raw signal together with the features mean and current drop are depicted (yellow and green lines).The lower panels represent the same but processed signals.These are used for extracting the feature height denoted by the purple vertical line (see text for details).In (e) a picturesque representation of the experimental setup, showing the two different current measurements ionic (I ion ) and electronic (I el ), avoiding details apart form the most important parts involved, namely the solution, the analyte, and the nanopore.In (f) the algorithmic workflow built in this work are sketched and discussed in the text.

Figure 2 .
Figure 2. Normalized two dimensional clustering graphs for the ssDNA dataset.The clusters of the blockade mean and blockade height features in the (a) ionic (b) and electronic measurement channels are depicted, as denoted by the legends.The lower panels evaluate both channels together, in (c) for the blockade mean and (d) for the blockade height.The black filled circles denote the center of each cluster.

Figure 3 .
Figure 3. (a) The class balance for all datasets, before (filled bars) and after (empty bars) the application of the random over-and under-sampling schemes.(b) The mean absolute SHAP values for the (top) ionic ('i') and (lower) electronic ('el') current traces for the ssDNA experiment using the CNN.In the insets the trends obtained with XGB are provided for comparison.(c) The global mean SHAP values for the RFC classification, derived from the mean absolute SHAP values of the three datasets.For the color coding refer to the legend.(d) The deviation ΔSHAP (in %) of the calculated mean SHAP values within all three schemes for the ssDNA experiment.

Table 2 .
Filter parameters (see text) applied for the event detection in both ionic and electronic channels.In the second column, the charge state of each biomolecule is given, as known in normal conditions.Note, that the effective charge state might differ in the experiments.