The information of attribute uncertainties: what convolutional neural networks can learn about errors in input data

Errors in measurements are key to weighting the value of data, but are often neglected in Machine Learning (ML). We show how Convolutional Neural Networks (CNNs) are able to learn about the context and patterns of signal and noise, leading to improvements in the performance of classification methods. We construct a model whereby two classes of objects follow an underlying Gaussian distribution, and where the features (the input data) have varying, but known, levels of noise. This model mimics the nature of scientific data sets, where the noises arise as realizations of some random processes whose underlying distributions are known. The classification of these objects can then be performed using standard statistical techniques (e.g., least-squares minimization or Markov-Chain Monte Carlo), as well as ML techniques. This allows us to take advantage of a maximum likelihood approach to object classification, and to measure the amount by which the ML methods are incorporating the information in the input data uncertainties. We show that, when each data point is subject to different levels of noise (i.e., noises with different distribution functions), that information can be learned by the CNNs, raising the ML performance to at least the same level of the least-squares method -- and sometimes even surpassing it. Furthermore, we show that, with varying noise levels, the confidence of the ML classifiers serves as a proxy for the underlying cumulative distribution function, but only if the information about specific input data uncertainties is provided to the CNNs.


Introduction
Machine Learning (ML) methods are becoming increasingly popular in the analysis of scientific data sets, especially in areas with large volumes of data such as high energy physics and astrophysics -see, e.g., Storrie-Lombardi et al. (1992); Firth et al. (2003); Baldi et al. (2014); Mehta et al. (2019).Typically, scientific data consists of individual measurements, each one but known, scatters) regime.We also use as a reference the maximum likelihood classification, which allows us not only to derive analytical formulas for the output errors, but also to find the exact uncertainty in the classification by running a Markov Chain Monte Carlo for each object in a test sample.In particular, in this paper we show that CNNs are able to learn about the context of the information in the specific noise levels of the data, in such a way that they approach (and sometimes even surpass) the performance of the maximum likelihood approach.

The smiley-frowny model
Our toy model consists of two simple classes of objects: parabolic curves with positive and negative concavities.Hence, we have a binary classification problem where the positive and negative classes are convex ("smiley", ) and concave ("frowny", ), respectively.The basic idea is that each object is represented by a set of n data points (or features), in such a way that each data point has an uncertainty that derives from some known probability density function (PDF).These uncertainties may be called the "error bars" of the measurements, which for our purposes can be thought of as the variances (second central momenta) of the PDFs.We discuss these uncertainties in detail in the next Section.
Since the underlying model (parabolic curves) is precisely known, we are able to classify each object using the method of maximum likelihood -e.g., if we only require the class of the object, a least squares optimization method can be employed.Moreover, we are able to determine exactly the confidence of the classification of smiley and frowny objects using either a Fisher matrix approach or a Markov Chain Monte Carlo (MCMC) exploration of parameter space.The overall performance of the classification, as well as the confidence of the output for each object, can then be compared with ML models which do and do not include the input data uncertainties.
It is important to stress that we are not trying to fit a curve.We are interested in employing a set of measurements in order to classify the objects.Our task is, therefore, to label curves as a smiley or a frowny in a binary classification scheme: • Positive class (1): • Negative class (0): The parabolic curves are generated according to the model: where the indices i = 1, 2, . . ., n, with n being the number of measurements, which are the attributes, or features, that characterize the curves.The parameters a, b and c are drawn from normal distributions (see Fig. 1) with means and standard deviations specified in Table 1.The random nature of the parameters ensures that we have a variety of objects in each class, and the two PDFs for the curvature parameter a are sufficiently separated (4 σ's) that the probability that an object sampled from the distribution of one class has the sign of the other class is 3.17 × 10 −5 , which is irrelevant for the purposes of our discussion.
With the definitions of Eqs.(1-2), both curves grow by the same amount from start to end: x(i = n) − x(i = 1) = b(n − 1)/n ± a (n 2 − 1)/n 2 , where the plus and minus signs refer to the positive (convex, smiley) and negative (concave, frowny) classes, respectively.Since the distributions of the curvatures are anti-symmetric, a ↔ −a , all curves on average rise by the same amount from i = 1 to i = n, which further mixes the two classes.This feature of the model ensures that the only significant distinction between smiley and frowny objects is the concavity of the curves.Otherwise, a ML classifier could employ other patterns (such as whether the function grows or falls with i) to distinguish between the two classes and, thus, outperform a model-based maximum likelihood method such as least squares.
Table 1: Parameters of the normal distributions from which the coefficients of the parabolic curves are sampled.

Noise and information in input data: a toy model
Scientific data sets are comprised of measurements which are performed with the help of instruments with some nominal uncertainties.However, those uncertainties are usually not fixed for all time: even with the same instrument, some measurements may have higher or lower uncertainties depending on several conditions.Any experiment will carefully assess what those uncertainties are for each data point, taking into account the different circumstances under which those measurements were made -see, e.g., Taylor (1997).
To be clearer, one can think of two main sources of aleatoric uncertainties.The first is the quality, or nominal sensitivity, of the apparatus used to perform the measurements: e.g., the lengths of nails that are measured with a micrometer are intrinsically more accurate than the same measurements made using a ruler.The second source arises from the different conditions under which the measurements are made by the same apparatus.As an example, one can think of the measurements of the lengths of those same nails with a micrometer, but in some days the temperature of the laboratory is more stable than others, resulting in different dilation factors for the nails.Another example comes from astronomical data sets: in that case, the nominal uncertainty is determined by the size of the telescope's mirror and the sensitivity of the detectors, among other factors.However, some nights are brighter than others, some objects appear close to bright sources of light, and so on and so forth, meaning that different images, as well as different parts of the same image, have varying degrees of data uncertainty.Any careful experimenter will label which measurements have higher or lower levels of uncertainty as a result of those different conditions.
In this work we construct a simple model to reproduce these varying degrees of uncertainty in scientific data.First, we assume that the nominal accuracy of the measuring instrument is given in terms of a parameter σ 0 , meaning that under some "ideal" conditions for that instrument, the measurements are random numbers that follow a normal distribution with variance σ 0 .And second, we introduce parameters g i that follow a uniform distribution, in such a way that the actual measurements x i (now under varying conditions) have uncertainties given by g i σ 0 .In other words, we have: where xi are the true values of the measurements and δx i are random numbers sampled from a Gaussian probability distribution functions with zero mean and variance g i σ 0 , i.e.: Here g i are numbers that are known for each individual measurement: one can think of a label for each data point indicating the degree to which the uncertainties are higher or lower than the nominal ones.In order to simulate the varying conditions under which those measurements are performed, we draw the factors g i from a Uniform distribution in the interval: where ∆g is the noise dispersion parameter, and the expectation value (mean) of that parameter given the Uniform distribution is g i U = dg g U (g) = ḡ, where U (g) = 1/(2∆g) for ḡ − ∆g ≤ g ≤ ḡ + ∆g, and U (g) = 0 if g < ḡ − ∆g or g > ḡ + ∆g (clearly, 0 ≤ ∆g < ḡ).
For ∆g = 0 we have a homoscedastic dataset, and as this parameter grows, the degree of heterodasticity increases.However, we stress the fact that the factors g i are known and, as opposed to the Gaussian random process underlying the noise, the values g i should not be regarded as stochastic variables in a fundamental sense: they are part of the information of the data set, and can be passed on to the ML methods.
Since the two distributions are uncorrelated by construction, the mean variance of the data errors can be easily computed: where σ0 = ḡ σ 0 denotes the mean nominal noise.This simple result tells us that when there are varying levels of noise in data, as described by this model, then the mean noise of the ensemble is actually higher than the mean nominal noise σ0 .
Traditional statistical tools for data analysis are naturally equipped to deal with different levels of noise in input data.In particular, the likelihood is given by: where N is some normalization and xi is the expectation value of the variable x i (or, in this context, the "theory" that we would like to fit to the data).In scientific applications we usually assume the theory to depend on a set of parameters denoted by the vector θ µ through some model, xi (θ µ ) -in our example, those parameters are θ µ = {a, b, c}, so µ = 1, 2, 3.The likelihood function tells us which regions in parameter space are preferred, given the model, the data, and the uncertainties (or, more generically, the data covariance).Although a more thorough exploration of the likelihood function in parameter space is usually carried out using Markov Chains generated via a Monte Carlo algorithm (in that respect, see Section 5), we can estimate the shape of the likelihood using a Gaussian approximation, in which case the logarithm of the likelihood is a quadratic function.The curvature of that multivariate quadratic function at the peak (maximum likelihood) is the Fisher information matrix, which is computed by means of the Hessian: The inverse of the Fisher matrix yields an estimate of the parameter covariance, Cov[θ µ , θ ν ] → {F [θ µ , θ ν ]} −1 -see, e.g., Tanabashi et al. (2018) for many examples and applications in the physical sciences.
For the likelihood function of Eq. ( 6) we obtain: where and we used the fact that the data is both unbiased, x i = xi , and that the "measurements" do not depend on the parameters, ∂ µ x i = 0 (the theory, on the other hand, obviously does: At this point we can take the expectation value over the uniform distribution of the noise dispersion to obtain the mean Fisher matrix: We recognize the sum on the right-hand-side as the Fisher matrix for a nominal uncertainty σ0 .Therefore, the amount by which different data points may have different uncertainties (expressed by the noise dispersion parameter ∆g) has the effect of increasing the Fisher matrix with respect to the case where the noise is fixed to the nominal value.
In other words: when each data point has a different (but known) uncertainty, even though the mean noise level is higher, the Fisher information actually increases.This happens, of course, because the specific noise levels in the data are acting as weights: noisier data are down-weighted, and less noisy data are up-weighted, resulting in higher discriminatory power.Here we are simply restating the fact that we always lose information if we do not distinguish between low-noise and high-noise measurements.This statement remains true when applied to ML techniques.
Our toy model for the smiley/frowny objects has a very simple, analytical Fisher matrix.Using Eq. ( 1) into the Fisher matrix of Eq. ( 9) we obtain: where the plus and minus signs correspond to the smiley and frowny objects, respectively.All terms in this matrix have a closed form, given by the sum rules: From these expressions we can compute the mean uncertainty in the parameter awhich is exactly the same for the two classes, since they are symmetric in the sense that a ↔ −a .Upon inverting the Fisher matrix we obtain the covariance matrix, whose diagonal term corresponding to the parameter a yields the result: where we defined the posterior uncertainty of the parameter a by Σ a , which should not be confused with the width of the parent distribution for that parameter, σ a , that expresses the intrinsic (true) diversity of objects in our classes.As discussed above, the uncertainty in the class of the objects is lower when the input data has varying levels of noise.Furthermore, with only two points (n = 2) the class is completely undetermined -as it should, since with two points it is impossible to derive the curvature.For larger values of n the uncertainty scales as Σ a ∼ 1/ √ n.We can further define a mean confidence for a maximum likelihood classification.The variance obtained above can be used in a normal distribution for the parameter a, and integrated for positive and/or negative values, yielding the probability that an object is either in the positive or negative class.E.g., the probability that an object has a > 0 is given by: where Erf(z) = 2/ √ π z 0 dt e −t 2 is the error function, and µ a and σ a are, respectively, the central value and variance of the distribution for the parameter a -see Table 1.The expression above is therefore the average confidence of the maximum likelihood classification -and, of course, it also expresses the cumulative distribution function at the value µ a .
In Fig. 2 we plot the probability of Eq. ( 17) for µ a = 1 (smiley class), as a function of ∆g.In this example we used ḡ = 1 and each object has n = 20 features (data points).From top to bottom, the curves correspond to increasing values of the nominal input error, σ0 = σ 0 = 0.25, 0.5, and 1.0.For very high uncertainties in the input data (σ 0 1.0) the probability approaches 0.5, which means zero confidence in the classification.That confidence grows as we lower the nominal uncertainty and/or if we increase the noise dispersion parameter ∆g.
However, as much as the calculations above are able to provide insights into the problem at hand, we would be misguided if we attempted to use Eq. ( 17) to infer the confidence for a maximum likelihood classification of individual objects, for two reasons.First, the Cramér-Rao-Fréchet bound [see, e.g, Efron (1986)] implies that the Fisher estimator has minimal variance, meaning that the probability expressed by Eq. ( 17) is an extreme, limiting case.The second reason is that we approximated the actual Fisher matrix, Eq. ( 8), by an average over the (uniformly distributed) noise dispersion parameter, which resulted in Eq. ( 9).Due to the random nature of the specific uncertainty g i in our model, individual objects may have better (less noisy) or worse (noisier) data points, and for those objects the confidence will differ from what is expressed by Eq. ( 17).Hence, Eq. ( 17) represents an ideal scenario: in practice, applying the least squares method for a sample of objects results in an average accuracy which is slightly worse than the one that results from using this analytical formula.Therefore, the exploration of the likelihood in parameter space shall be performed object-by-object according to Eq. ( 6), using either a least squares method (if one is only interested in the class itself) or an MCMC (if we also need to know the probability of the classification).These maximum likelihood results then form the basis for our comparison with the performance and confidence of the classification using ML methods.
It is instructive, in the context of this Section, to also consider an associated linear problem.A linear estimator for the curvature is given by: where , and the indices µ = 2, 3, . . ., n−1 can be regarded as the n−2 intermediate points where we are able to estimate the curvature (a) through differences of the neighboring points.It is easy to check that the linear estimator is unbiased, and independent of the model parameters b and c: just use the identities i M µi = i M µi i = 0, and i M µi i 2 = 1.
The covariance of the linear estimator is given by the expectation value: which, after averaging over the uniform distribution for the noise dispersion leads to: It is straightforward (though by no means trivial) to check that the mean Fisher information matrix corresponding to this linear estimator is given by: which can be compared with Eq. ( 16).Equation ( 21) highlights a property that was already apparent in Eq. ( 16), which is the fact that the uncertainty in the output (the curvature, a) depends on the number of features only through the factor 180 n 3 /(n 4 − 5n 2 + 4).The output uncertainty on the pattern of input errors, on the other hand, is encapsulated by its dependence on ∆g.When we include the information about specific noise levels through inverse covariance weighting, as expressed by Eq. ( 8), that pre-factor is 1−∆g 2 /ḡ 2 -i.e., providing the information about noise improves the constraints.However, when we neglect the error information and resort to direct estimators such as âµ , then that pre-factor becomes 1 + ∆g 2 /(3ḡ 2 ), increasing the output uncertainties.
Finally, it is worth pointing out the limitations of tools such as Tikhonov regularization, which are often used to prevent overfitting and to minimize empirical error -see, e.g., Bousquet et al. (2013), as well as related methods such as the one proposed by Czarnecki and Podolak (2013).Basically, regularization techniques work by effectively imposing a threshold on very small eigenvalues in ill-posed inverse linear problems.However, the associated linear problem presented above is perfectly well-posed, and still it results in a degradation of the output uncertainties when compared with the optimal (inverse covariance weighting) estimator.Furthermore, the covariance of the linear estimator, Eq. ( 19), is a positivedefinite matrix, µν C µν V µ V ν ≥ 0 for any Real-valued vector V µ , so all the eigenvalues of this covariance matrix are real, non-negative numbers.Consequently, imposing any kind of minimum threshold for those eigenvalues would in fact increase the linear estimator uncertainty, Eq. ( 21), which means that no amount of regularization can possibly compensate the lack of information about the noise of each input data point.

Machine Learning Classifiers
Our problem statement is to classify sets of features that characterize curves which belong to a given class.We can model this as a supervised machine learning classification task.Our training sample is the set of tuples {x (j) , y (j) } m j=1 , where x (j) ∈ R n are the smiley-frowny parabolic curves and y (j) ∈ {0, 1} are the corresponding labels.Additionally, we have the uncertainties σ (j) ∈ R n for each x (j) , for those models that use that information.
Due to the nature of the data, where the features are sorted in a significant way, we find it more appropriate to use CNNs since they are able to recognize local patterns.Multiple problems of sequential data analysis are in fact tackled with CNNs and 1D convolutional kernels -see, e.g., Acquarelli et al. (2016); Busca and Balland (2018); Cabayol et al. (2018);Ismail Fawaz et al. (2019); Mozaffari and Tay (2020); Kawamura et al. (2021).In this work we implemented the classifiers with keras (Chollet et al., 2015).
To classify the smiley-frowny curves, we start with a baseline dataset with the parameters described in Table 2 and in Section 5 we explore multiple scenarios where we deviate  We created multiple CNNs which differ from each other mainly in the content and shape of the input data.The specifications of each version are described in the next subsections.The general training settings are the following.We used the binary cross-entropy loss function and the Adam (Kingma and Ba, 2017) optimizer.In all intermediate layers we used the ReLU activation function and in the last layer we used the Softmax activation function so that the scores of both classes sum up to one and, thus, we have a probabilistic interpretation for the output.The convergence of the training was monitored with learning curves for accuracy and for the loss function at each iteration (epoch) for both training and validation sets and we used batches of size 100 to train the networks.We used the EarlyStopping callback conditioned to the validation set loss score with patience of 16 epochs and the ReduceLROnPlateau callback to reduce the learning rate when the validation set loss stagnate for 10 epochs.The final set of weights is the one corresponding to the epoch with best accuracy in the validation set.
We now turn to the description of the different networks that we trained in order to classify the smiley and frowny objects.

CNN1D
In the CNN1D models, the input data shapes are 1D vectors, and, thus, the convolution kernels are also 1D.We compared three versions of CNN1D models (see Fig. 3): • no-σ: the input data shape is (n, 1), it contains only the n measurements.
• with-σ: the input data shape is (2 • n, 1) where the n measurements are followed by the n corresponding uncertainties.This is a first approach to include uncertainties, but without making any hypothesis on what is the best way to represent this additional information.
• stack-σ: the input data is the set of measurements and errors arranged in channels.
The input shape, therefore, is (n, 2).It is identical to CNN1D with-σ in terms of the available information, but, in this case, the spatial relation between the measurements and corresponding errors is represented in a straightforward way in this model, providing a context for the uncertainties.
For the baseline training set (Table 2), we defined a simple standard network for all three CNN1D versions, which consists of three convolution layers with kernel shapes (5, ), (3, ) and (3, ) with 32, 64, 64 filters, respectively, one intermediate dense layer with 64 neurons and, finally, the output layer with 2 neurons.Each convolution layer does padding, i.e, the output feature map has the same size as the input feature map, and is followed by a MaxPooling layer with kernel size and stride (2, ).We also add BatchNormalization and dropout layers with dropout rates typically between 0.2 and 0.4.As we deviate from the baseline dataset, some modifications on this standard architecture might be necessary to ensure the convergence of the models.In particular, as we increase the level of noise in the data by either decreasing the number of features n or increasing the nominal error σ 0 , a less complex network (with a lower number of layers) or a more regularized network is more appropriate because, as the data becomes noisier, the models are more likely to overfit (Abu-Mostafa et al., 2012).

CNN2D images
Several problems in the physical sciences have benefited from the power of ML methods that were developed for the analysis of images, in particular applications developed for high energy physics (Baldi et al., 2014) or astrophysics (Estrada et al., 2007).In the CNN2D method we build on the same idea, however we use the additional dimension to represent the uncertainties in input data.
The idea of the CNN2D images method is to represent the complete distribution of the data, given the specific uncertainties.We then organize the data, including the errors, in terms of a matrix whose columns correspond to the features (i), and the rows correspond to the values of distribution function of the input data.To be specific, we discretize the range of input values in terms of bins x i → x ρ i , with ρ = 1, 2, . . ., n rows, and then define values of the pixels of the CNN2D images as: The left panel of Fig. 3 illustrates the input data for this model.
The input shape of this model is (n rows, n), where the number of rows n rows is one among other hyperparameters of the images that must be chosen.More details about the construction of the images can be found in Appendix A. The standard network for the baseline training set (Table 2) consists of three convolution kernels with shapes (5, 5), (3, 3) and (3, 3) with 32, 64, and 64 filters, respectively.The convolution layers are followed by MaxPooling layers with kernels with size and stride of (2, 2), except for the first layer where the shape and stride of the kernel were chosen to be such that the output shape of this layer is always (10, 10), i.e, it depends on the shape of the input image.
The idea of representing a data vector with errors in terms of an image can be generalized for scientific data that is given in terms of pairs {x i ± σ x i , y i ± σ y i }.In that case, the multivariate probability distribution associated with the uncertainties σ x i and σ y i mean that each data point i is "spread out" both in the horizontal (rows, x) as well as the vertical (columns, y) directions.A CNN where a 2D data set (including uncertainties) is represented by images was used recently to classify supernovas (Qu et al., 2021).

ML confronts maximum likelihood
We now present the results for the classification of curves in the two classes (smiley or frowny), in the presence of Gaussian noise in the input data.As discussed above, we generate the parabolic curves by sampling random parameters (a, b and c) for the two classes, and then computing the n features of the curves (x i ), according to Eq. (1-2).The next step is to add noise to those features, according to Eqs. (3-4).It is important to stress that, just like in real experiments, the values g i are stored, which allows us to keep track of the error bar for each data point, g i σ 0 .However, for each object neither the underlying "true" curve, xi , nor the random noise, δx i , are known a priori: we only have access to the measurement, x i = xi + δx i , and the noise levels g i σ 0 (the "error bars").
In order to study how the performances of the different networks change as a function of the parameters of the baseline dataset, we will let some individual parameters change, while the others are fixed to the values from Table 2.

Varying the signal-to-noise ratio
We start by investigating how the performance of the CNN classifiers, as well as the maximum-likelihood (least squares) method, depend on the nominal noise σ 0 .Fig. 4 shows the accuracy of the classifiers.For very high nominal noise all the classifiers eventually fail, with accuracies of 50%, and conversely, for very small noise all models are able to achieve a near-perfect accuracy.Therefore, as one should expect, in both limits (σ 0 → 0 and σ 0 → ∞) all methods become equivalent, since there is no information in the noise levels.However, for intermediate levels of noise, the classification methods that make use of the information about the specific noise levels in the input data are able to achieve better performance than the CNN1D no-σ method, which does not.Moreover, for σ 0 0.3, the CNNs that take into account the noise levels of the input data are able to reach accuracies which approach that of the maximum-likelihood classification.This means that when we pass the information about which features are more noisy and should be down-weighted, and which features have less noise and should be up-weighted, we allow the algorithms to learn about how to use those weights for their classifications.
From the perspective of training the CNNs, depending on the value of σ 0 we have added more regularization or reduced the number of layers.Since noisier objects tend to overfit the network, for large values of σ 0 the standard architecture for the baseline data set may not be optimal, and sometimes the models do not converge properly.
In particular, we note that the CNN1D stack-σ method has a performance that is very similar to the CNN2D images method.This is perhaps to be expected, since the noise is Gaussian, and in the absence of skewness or kurtosis the key information about the distribution is already encoded in the standard deviation, g i σ 0 .For more general PDFs, where it is not possible to summarize the shape of the distribution in terms of a single parameter, using the CNN2D images approach might be an interesting alternative.Alternatively, one could think of generalizing the CNN1D with-sigma models to multiple channels, each one containing the different momenta of the underlying distribution functions.
Still looking at Fig. 4, we see that all three models that include the errors have better accuracies than CNN1D no-σ, but CNN1D with-σ is slightly worse than the other two.This difference can be explained by a fundamental difference in the form errors are provided to the models.While the point-wise signal-noise association is preserved in the input of CNN1D stack-σ and CNN2D images, the same does not happen for CNN1D with-σ, causing the natural association to be lost.The model may eventually recover such association, but dismissing the association only adds unnecessary difficulties to the model.Therefore, in what follows we opted to discard the CNN1D with-σ.
The typical values of σ 0 for which we see a significant difference between the models that include or do not include the uncertainties depend on the number of features n.If there is an abundance of points that characterize the curve, the classification become easier, and σ 0 should be larger in order to cause some confusion between the classes.In other words, it is equivalent to either lower the overall level of noise (σ 0 ) or to increase the number of features (n).This is shown in Fig. 5, where we evaluate the performance of the models as we increase the number n of features, for a fixed σ 0 = 3.2.As we grow the number of features, all methods become more efficient, but the CNN1D method without errors clearly underperforms the other methods.As the number of features grows, we are, once again, able to use less regularized architectures because, as we increase the overall signal, noise becomes less important and the networks become less sensitive to overfitting.Again, this is in agreement with the case where we vary σ 0 .

Varying ∆g
Figure 6 shows the performance of the classifiers as the noise variance parameter ∆g changes.This plot shows that, as we increase the relative difference between the noise levels of the data points, that information becomes more critical to the classification.That information is naturally used in the least squares method, and is also learned by the CNNs that are provided the error bars, but the CNN that is only provided the input data is unable to harness that information to improve the classification.

Probability Output
We now address the question about the quality of the ML classifiers as compared with an MCMC approach.For each object we explore the likelihood function, Eq. ( 6).We use flat priors for the parameters, which were free to vary inside the ranges a ∈ [−3, 3], b ∈ [−5, 5] and c ∈ [−5, 25] -i.e., more than 8 standard deviations for the parameter a, and about three standard deviations away from the means for the parameters b and c.
The goal of the MCMC is to compute the posterior probability that the maximum likelihood classification for each individual object is correct.In order to compute that probability, we count the fraction of points in the chains that are assigned the correct class of each object.Since each object is also assigned a class by the three different ML methods, together with their respective confidences, we can plot the ML confidence as a function of the MCMC probability for all objects in the test sample.
This comparison is shown in Figures 7 and 8.In Fig. 7 we show the CNN confidence and the MCMC probability for 5 × 10 4 objects in the smiley (positive) class, using the baseline dataset.Objects with MCMC probability > 0.5 are classified in the correct class, and when the probability is ≤ 0.5 the maximum likelihood classification fails.The same applies for the CNNs: a confidence of 0.5 marks the threshold between correct and wrong classification.
The most revealing aspect of Fig. 7 is that, when the noise levels of all input data points are nearly the same (nearly homoscedastic case, ∆g = 0.1, top row), the three ML methods hold a tight correlation between the confidence of the classification and the MCMC probability that the classification is correct.However, when the data points have significantly different levels of noise (heteroscedastic case, ∆g = 0.5, bottom row), if those uncertainties are not passed on to the CNN (as is the case of the CNN1D no-σ model, left panel), then there is basically no correlation between the ML confidence and the MCMC probability.But when the input data uncertainties are part of the information provided to the CNNs, a clear correlation appears between the ML confidence and the probability.In fact, the ML confidence is well fitted by the sigmoid function: This function can be easily inverted, which allows us to define: A measure of the goodness of this fit can be computed with the mean squared error (MSE) metric: where m test is the number of objects in the test sample, and p(j) is the probability (according to the MCMC) that the classification of object j is correct.We obtain that, for ∆g = 0.1, the MSE for all methods is below 0.01, with the CNN2D images method performing slightly better at 0.0026, compared with 0.0059 for CNN1D no-σ and 0.0052 for the CNN1D stack-σ model (we use a sample of m test = 5 × 10 5 objects in order to compute this statistic).When we increase the noise dispersion to ∆g = 0.5, the CNN1D no-σ fails to fit the sigmoid, with an MSE of 0.0386, whereas the CNN1D stack-σ and CNN2D images methods still performing well, below 0.01 -see Table 3.
Table 3: MSE of the cases shown in Fig. 7.
CNN1D no-σ CNN1D stack-σ CNN2D images ∆g = 0.1 0.0059 0.0052 0.0026 ∆g = 0.5 0.0386 0.0087 0.0064 The results summarized in Figures 7 and 8, together with Table 3, mean that, when the noise levels of all the input data are the same (or nearly the same), then the ML methods are able to estimate the quality of the classification (the confidence) in a way that works as a proxy for the probability that this classification is correct.On the other hand, if different data points have varying levels of noise, then it becomes essential to pass that information on to the networks.When that information is hidden from the network, as happens for CNN1D no-σ, then the method loses its ability to provide a confidence that is significantly correlated with the probability for that classification -see the bottom left panel of Fig. 7.However, if we pass the noise properties of the data as information to the CNNs, then we allow those networks to reconstruct confidences that are tightly correlated with the probabilities for the classification.
Another way to visualize the predictive power provided by the noise information is to take the points shown in Fig. 7, and separate them into objects that are correctly classified by the MCMC (probability > 0.5), and those that are incorrectly classified.In Fig. 8 we show the resulting distribution of objects as a function of the ML confidence for the incorrectly classified (left panel) and correctly classified (right panel) objects.We also show how that confidence varies with the size of the training sets (m = 1, 2 and 5 × 10 5 objects).It is immediately clear that the CNN that is blind to the information about uncertainties is unable to pick out the more noisy objects, and as a result it assigns high confidences to objects in the wrong class much more often than the other methods.Furthermore, for the objects that are correctly classified by the MCMC (right panel), the CNN1D no-σ method tends to assign lower probabilities to more objects, which again is a result of those methods being unable to weigh features by their uncertainties (notice that the log-scale plot makes it a bit difficult to see that the number of correctly classified objects with the highest confidence is significantly lower for the CNN1D no-σ method).

Varying the training set size
An important issue that appears as we increase the dimensionality of the system by including additional parameters related to noise is the size of the training set that is needed for the network to converge.We have evaluated the performance of all classifiers as a function of the size of the training set for our baseline model, with n = 20, σ 0 = 0.5, ḡ = 0.6 and ∆g = 0.5.We froze the same architecture that was used with m = 2 × 10 5 objects in the training set, and re-trained the network with different sizes in order to see how its performance degraded or improved as we lower or increase the number of objects.We also analyse how much sensitive the model becomes to the initial seed as the training set size decreases.For lower number of objects, it is harder for the model to converge.This is not the case for larger training sets, where the model converges to very similar results with different initial seeds.Of course, it is still possible to improve the accuracy for larger/lower m if one reduces/increases the complexity of the network.The left panel of Fig. 9 shows the accuracy for both train (dashed lines) and test (solid lines) sets of the models, as we increase the number of training instances.This means that, as we take m → ∞, we have minimized epistemic error for each ML method, and all that remains is the impact of aleatoric errors on the different models.The lines correspond to the median value of multiple realizations, and the error bars (shaded regions) correspond to "normalized median absolute deviation" σ NMAD , which is defined by: This measure of the width of a PDF reduces to the variance in the case of a mono-variate Gaussian distribution, but is less affected by the tails of the distribution.We see that, for the training set sizes that we analyzed, no amount of training data is sufficient for CNN1D no-σ to come close to the performance of the models that include the information about uncertainties1 .Notice that our toy model allows us to generate an arbitrarily large number of objects, while data augmentation usually employs some fixed set of objects and then adds an artificial amount of noise to the input data of those objects.Therefore, the larger training sets of Fig. 9 are composed of objects which behave exactly like the ones in the validation and test sets, while data-augmented training sets are made up of a mix of original objects as well as objects which behave in a fundamentally different way compared with the original ones.As a result, the performance of a model trained on a set of m original objects is always superior to that of a model trained on a set of m objects that was created with the help of data augmentation techniques.Hence, this result also proves that data augmentation techniques cannot possibly overcome the deficit of neglecting the information about uncertainties.Both CNN2D images and CNN1D stack-σ, even if the former presents higher complexity than the latter, have similar performances, specially when considering the intrinsic model variations (shaded regions in Fig. 9).Moreover, the amount of data required for the models to converge to some accuracy depends on how noisy the data is: when the data is more noisy, the models are more likely to display overfitting.This is shown in the right panel of Fig. 9, where we plot the accuracy of the models as a function of size of the training, for different values of σ 0 (we only show the results for the CNN1D stack-σ, for clarity, but all methods behave in essentially the same way.)Notice that in this plot we normalized the accuracy of the CNN classification by the accuracy of the least squares (maximum likelihood) classification, in order to highlight the fact that the difference between those accuracies is more pronounced for larger values of the nominal uncertainty σ 0 .

Poisson Noise
The Poisson distribution is particularly interesting because the uncertainty of the measurement can be deduced from the measurement itself.In that sense, it would be redundant to provide the noise information: if a feature has a measured value x i , a good estimator for the noise is already given by √ x i .Hence, we expect that the information of the uncertainty is already encoded in the value of the measurement and, therefore, it does not make any difference to use as input only the measurement, with CNN1D no-σ, or to use the measurement and its associated error bar, like we do with CNN1D stack-σ or CNN2D images.In order to work with Poisson noise, it is convenient to employ measurements that have values closer to one, in order to reinforce the skewness of the distribution.Therefore, we used an adapted version of the smiley-frowny model to study the Poisson case.We lowered the mean value of the parameter c of the parabolic curves to get mean values closer to one, and also reduced its standard deviation to avoid measurements with negative values.The few objects which happened to display any feature with negative values were discarded.Finally, since the noise in this case turns out to be relatively small, we also chose different values of c for smiley and frowny objects, so we could shuffle the curves more efficiently and prevent the machine from distinguishing between both classes by their absolute values instead of the curvature.Table 4 presents  In the case of Poisson, we do not have a parameter such as σ 0 , as we had in the Gaussian noise case.However, we can vary the level of noise by increasing or decreasing the number of features, n.Fig. 11 shows the dependence of the performance with n for the Poisson case -compare it with Fig. 5, for the Gaussian noise model.These results show that the accuracy improves as n increases, but all models have basically the same accuracy.What it means is that including or not the uncertainty is irrelevant when the noise is drawn from a Poisson distribution: indeed, in that case there is no additional information in the noise levels (error bars) that is not already present in the measurements (the input data).

The waveform dataset
We now apply our methods to the publicly available waveform data set2 (Breiman et al., 1984).That data set consists of three different combinations of three functions: class 1 : where i = 0, 1, . . ., 20 labels the n = 21 features, and the three "parent" waves are shown in Fig. 12 -from left to right, h (1) , h (3) and h (2) .The random variable u is drawn from a Uniform distribution, u ∈ [0, 1].
There are two versions of the waveform dataset available in the UCI repository 3 , which provides a data folder with a data generator code written in C and also a document with  , 20].For i = 0 and i = 21, the value of the wave is zero for all three combinations (classes) (see Figure 12), which means that x 0,21 = δx 0,21 .Therefore, explicitly adding δx i to the data set is redundant.
The task is to classify objects according to these three types, using the n = 21 measurements and uncertainties.In the original versions, the noises δx i from the waveform data set were generated from the same normal distribution N (µ = 0, σ = 1).The data set actually included the noise itself, δx i , instead of the standard deviation of the Gaussian distribution from which those δx i were sampled.Since the noise is sampled from the same PDF for all points, there is no value in informing the networks about that noise.For this reason, we adapted the code that generates those features in order to have noises with different variances.Just as in the original waveform data set, we draw the uncertainties δx i from normal distributions: but in our modified version the noise levels g i are sampled from a uniform distribution with ḡ = 0.8 and ∆g = 0.5, i.e., g i ∈ [0.3, 1.3].
We have computed the performance of the three ML classifiers, CNN1D no-σ, CNN1D stack-σ, CNN2D images, as well as the classification using the maximum likelihood4 , in the following situations: first, we kept ∆g = 0.5 fixed, and varied the nominal error parameter σ 0 .Second, we fixed σ 0 = 1.25 and varied the noise dispersion parameter ∆g.
The results are shown in Fig. 13.They resemble closely those shown in Fig. 4 and in Fig. 6, and show that the CNNs that take into account the different levels of noise are superior to the method that discards that information, regardless of whether the data is overall less or more noisy (lower or higher values of σ 0 ).The right panel of Fig. 13 shows that, as ∆g grows and the information about which data points are more or less noisy becomes increasingly relevant, the methods that can account for these differences in signalto-noise outperform by far the CNN1D no-σ method, which is blind to those distinctions.Notice, in particular, that in the limit ∆g → 0 we recover the original waveform version, where the noise has a Gaussian distribution with N (µ = 0, σ = ḡ • σ 0 = 1).
The main difference with respect to the smiley-frowny model is that now the maximum likelihood classification can be worse than the CNNs, since the latter are able to pick out the distinguishing global features of the parent waveforms.This is because the maximum likelihood classification is derived from a point-wise fit, such that the input data points contribute to the fit independently from each other.The ML methods, on the other hand, are able to relate and combine different data points in order to detect the shape of the object, resulting in a more robust classification.

Discussion and Conclusion
In this paper we address the value of the information about noise in input data for Machine Learning (ML) methods.We have shown that, when a data set includes not only the (noisy) measurements of the features, but also the information about the underlying distribution functions that generated that noise, CNNs are able to learn about the context of that noise, improving the performance of classification tasks and reaching "optimality", defined here in terms of a maximum likelihood approach.
In order to prove this statement we created a toy model for two classes (the "smiley" and "frowny" parabolic curves), and a model for input data noise that realizes the typical process of measurement.Each object was generated from parameters that obey a random process, allowing us to build arbitrarily large sets that we can use to train, validate and test our methods.Noise, on the other hand, was also generated by means of a random process, but in such a way that each data point (feature) has a noise that is drawn from a different PDF, whose dispersion is known: this is the "error bar" associated with each feature.This is exactly what takes place in a laboratory: the experimenter not only takes the measurement, but also assesses the uncertainties of each measurement -which are typically not all identical.
As a result, not only the objects in our two classes have known underlying distributions, but each object can be classified using a maximum likelihood approach.This creates a standard against which the ML methods can be compared, as well as the concept of an "optimal" accuracy for the classifiers.Notice that optimality, defined in this sense, has to be used with great care: ML methods can outperform maximum likelihood estimators when there are non-local patterns in the data that can serve to distinguish the objects.However, in our toy model we precluded any such patterns from appearing, since we limited the distinction between the two classes to a single parameter -the curvature.It is in that sense that we can define optimality.
Our main result is that, when the information about data noise is passed on to a CNN, it can learn how to use the different levels of noise to weigh the input data.This leads to improvements in the performance of the classifiers, in such a way that the accuracy of the ML classification approaches that of the optimal (maximum likelihood) estimator.In fact, the more the noise levels vary from point to point (as controlled by the noise dispersion parameter ∆g), the better the performance of the CNNs that included the noise level information compared with the CNN that did not.
Moreover, we showed that, when the levels of input data noise are not all identical, the confidence of the ML method that is ignorant about those specific noise levels becomes uncorrelated with the underlying cumulative distribution function -see Fig. 7.However, when the noise levels are provided as additional data inputs to the CNNs, the resulting confidence of the classifiers can again be mapped onto the MCMC probability for the classification.Although that mapping is noisy, Fig. 7 shows that the CNNs seem to be using the information about the different levels of noise in the input data to reconstruct what is, in effect, a proxy for the likelihood function.
We have further tested CNNs with and without the noise level information using a slightly modified version of the waveform data set (Breiman et al., 1984).Just as happened for the smiley-frowny model, including the information about the different noise levels improves the accuracy of the classification, by an amount that becomes larger as we increase the noise dispersion parameter ∆g.We also computed the classification of objects in that data set using a maximum likelihood approach, however in that case the global patterns of the objects (in this case, the two peaks at known positions) can be detected by the CNNs, hence in some instances the maximum likelihood method was inferior to the CNNs.Nevertheless, when the information about noise levels is included in the CNNs, they always overperform the maximum likelihood classification -see Fig. 13.
We also checked that the noise levels are only relevant when they provide information that is not already included in the data set itself.In order to show this, we created a modified version of the smiley-frowny model whose features are numbers drawn from a Poisson distribution.In that case, for each feature x i the noise levels are well approximated by √ x i , and therefore there is very little additional information being provided by adding those errors to the data set.And indeed, what we find is that in this case the CNNs with and without error information have basically identical accuracies (after accounting, of course, for the different levels of complexity of the models.) It is important to stress that ignoring the information about the different levels of noise in input data degrades the quality of ML classifiers in a way that cannot be offset by adding objects to the training set -through, e.g., the use of data augmentation techniques.As can be seen in Fig. 9, increasing the size of the training set, even in an ideal setup such as the one provided by our toy model, is not sufficient to allow the CNN without error information to achieve the accuracy level of the CNNs that include that information.In other words, the noise information is essential, and cannot be substituted or compensated.Moreover, as discussed in Section 3, regularization techniques are also insufficient to compensate for the lack of information about the different levels of input data noise.
Finally, we ought to remark that our conclusions were drawn in the context of a model inspired by scientific data, where tracking signal and noise are commonplace.However, in all areas of data science the issue of measurement is a key one: some data sets are more robust than others, and some data points are more reliable than others.Furthermore, estimations about the levels of noise in input data are often available to the data analyst not only in the hard sciences, but, e.g., in economic data or in the social sciences as well.What we have shown here is that providing these noise levels to ML methods adds significant information to the algorithms, improving the performance of classification or regression tasks in a way that cannot be compensated by techniques such as data augmentation or regularization.
Python codes and documentation can be found at https://github.com/nvillanova/smiley_frowny .The choice of these hyperparameters might have a relevant impact on the performance of the model in some cases because these parameters define how much information of the data is being communicated through this representation.To illustrate this, we show in Figure 14 a smiley curve with n = 20 and σ 0 = 0.2 represented in three images built with different hyperparameters.We see that these parameters control the "resolution" of the curve, i.e., the number of pixels to bin x i .More specifically, the width of the bins expressing the intervals for the values x ρ i are given by: ∆x = up bound + low bound n rows . (29) Figure 15: Top: from left to right: input matrices of CNN images with threshold = 0.001, 0.4, 0.8 and 1.0 (see Eq. ( 30)); Bottom: accuracy as a function of threshold.

Figure 1 :
Figure 1: Parabolic curves generated according to the Smiley-Frowny model.The two lines in the left (green and dark green) are "smiley" objects, and the lines in the right (blue and dark blue) are "frowny" objects.The vertical shifts (parameter c) of the curves shown were fixed for visualization purposes.The noisy measurements of the features of those objects are shown as data points with error bars.Each data point has a different, but known, PDF whose properties are summarized by the variances expressed in the error bars.

Figure 2 :
Figure 2: Mean probability that smiley objects are correctly classified.From top to bottom, the curves correspond to the parameters σ 0 = 0.25 (blue line), 0.5 (orange) and 1.0 (green), respectively, and we used n = 20 features.

Figure 3 :
Figure 3: Input data illustration.On the right we present the input formats for the three CNN1D networks, x i are the measurements and σ i = g i σ 0 are the uncertainties of each measurement.On the left we show how to convert the input data (center) to a matrix, or image, for the CNN2D network (notice that this is an image with a single channel, where the color gradient indicates the values of each pixel).The black rectangles represent the convolution kernels.

Figure 5 :
Figure 5: Accuracy as a function of the number of features n for fixed ḡ = 0.6, ∆g = 0.5 and σ 0 = 3.2.

Figure 8 :
Figure 8: Confidence of the CNNs in the classification of the smiley objects which were incorrectly (left) and correctly (right) classified by MCMC, in the case where g ∈ [0.1, 1.1].The model parameters are the same as the ones used for Fig. 7, but here we also vary the size of the training set.Solid lines correspond to training sets with m = 500 × 10 3 instances, dashed lines m = 2 × 10 5 and dotted lines m = ×10 5 .The curves are the mean values over multiple training realizations.

Figure 9 :
Figure 9: Left: Accuracies as a function of the training set size m.Solid and dashed lines correspond to test and train set, respectively.The curves are the median value of multiple realizations and the shaded region covers the interval median +/σ NMAD (26); Right: Ratio between the accuracy of CNN1D stack−σ and the accuracy of least squares as a function of the number of training set instances for σ 0 = 0.2, 0.5, 0.8.
the distribution of parameters used to test the Poisson noise model, and Fig. 10 shows the input data for CNN2D images with n = 20, which represent the measurements.Table 4: Parameters of the normal distributions from which the coefficients of the parabolic curves are sampled -Poisson version.

Figure 11 :
Figure 11: Accuracy as a function of the number of features n in the Poisson version of the smiley-frowny model.

Figure 12 :
Figure 12: h functions of the waveform dataset.

Figure 13 :
Figure 13: Results of the classifications applied to the modified waveform data set.Left: fixed ∆g = 0.5, varying σ 0 ; Right: fixed σ 0 = 1.25, vary ∆g.For both cases we take ḡ = 0.8, thus σ0 = 0.8 σ 0 .The standard waveform data set has σ0 = 1 and ∆g = 0, which corresponds to extending the curves on the right panel to ∆g → 0. For comparison, the star symbol in the right panel shows the best accuracy (0.8702) obtained in the waveform classification competition (where ∆g = 0).