This site uses cookies. By continuing to use this site you agree to our use of cookies. To find out more, see our Privacy and Cookies policy.

Uncertainties in Parameters Estimated with Neural Networks: Application to Strong Gravitational Lensing

, , and

Published 2017 November 15 © 2017. The American Astronomical Society. All rights reserved.
, , Citation Laurence Perreault Levasseur et al 2017 ApJL 850 L7 DOI 10.3847/2041-8213/aa9704

Download Article PDF
DownloadArticle ePub

You need an eReader or compatible software to experience the benefits of the ePub3 file format.

2041-8205/850/1/L7

Abstract

In Hezaveh et al. we showed that deep learning can be used for model parameter estimation and trained convolutional neural networks to determine the parameters of strong gravitational-lensing systems. Here we demonstrate a method for obtaining the uncertainties of these parameters. We review the framework of variational inference to obtain approximate posteriors of Bayesian neural networks and apply it to a network trained to estimate the parameters of the Singular Isothermal Ellipsoid plus external shear and total flux magnification. We show that the method can capture the uncertainties due to different levels of noise in the input data, as well as training and architecture-related errors made by the network. To evaluate the accuracy of the resulting uncertainties, we calculate the coverage probabilities of marginalized distributions for each lensing parameter. By tuning a single variational parameter, the dropout rate, we obtain coverage probabilities approximately equal to the confidence levels for which they were calculated, resulting in accurate and precise uncertainty estimates. Our results suggest that the application of approximate Bayesian neural networks to astrophysical modeling problems can be a fast alternative to Monte Carlo Markov Chains, allowing orders of magnitude improvement in speed.

Export citation and abstract BibTeX RIS

1. Introduction

The use of neural networks for performing complex tasks has seen a rapid expansion in recent years. These networks have exceeded human performance in many experiments, including competing against a Go champion (Silver et al. 2016), playing Atari games (Mnih et al. 2015), and outperforming practicing dermatologists in the visual diagnosis of skin cancer (Esteva et al. 2017).

Neural networks are computational structures that can identify underlying relationships in new input data by learning from previously seen examples. These networks process their inputs by a series of multiplications with their weights and the application of nonlinear functions to the resulting values in processing units known as neurons. This process is repeated consecutively in multiple structures known as layers, with every layer being composed of multiple neurons and the output of each layer being the input of the subsequent one. In convolutional neural networks, the weights are organized in sets of filters, and the values of the neurons are the result of the convolution of the inputs with these filters. The values of the network weights are determined through a procedure known as training, where pairs of input–output examples, the training set, are presented to the networks and the values of the network weights are optimized by minimizing a cost function to reduce the deviation between the networks' predictions and the true values of the target outputs.

Commonly, neural networks consist of weights with fixed, deterministic values, resulting in deterministic outputs. If, instead, the weights of a network are allowed to span a range of values given by a probability distribution, the problem can be defined in a Bayesian framework (Denker & Lecun 1991). The resulting Bayesian neural networks can capture the posterior probabilities of the weights, yielding well-defined estimates of uncertainties. Inferring model posterior with these networks, however, is a difficult task, but different approximations have been introduced to facilitate its computation.

Recently, neural networks were applied to the discovery of lenses (Jacobs et al. 2017; Lanusse et al. 2017; Petrillo et al. 2017) and to the simulation of weakly lensed galaxy images (Ravanbakhsh et al. 2016). In Hezaveh et al. (2017) we showed that neural networks can be used to estimate the parameters of strong lenses. If these networks are to become the primary tool for the analysis of future data, the uncertainties associated with their predictions must be quantified. Here, we apply variational inference in order to obtain these uncertainties. We briefly summarize the statistical framework developed by Gal & Ghahramani (2015, 2016) and Kendall & Gal (2017), and apply it to the problem of estimating the parameters of strong lensing systems. We assess the accuracy of this method by calculating the coverage probabilities of the posteriors and find that by tuning a single variational parameter we can obtain accurate uncertainties.

In Section 2 we review the general framework for obtaining model uncertainties. In Section 3 we discuss the application of this method to strong lensing and examine the accuracy of the uncertainties. We discuss the results and conclude in Section 4.

2. Obtaining Model Uncertainties in Neural Networks

There are two sources of errors that contribute to uncertainties in the values of parameters estimated with neural networks. The first, aleatoric uncertainty, arises from inherent corruptions to the input data, e.g., detector noise and point-spread function blurring. The second type of uncertainty, epistemic uncertainty, stems from the networks' error in predicting the parameters of interest, e.g., due to insufficient training data. Epistemic uncertainties are generally network dependent: more flexible networks or more training can reduce them, while aleatoric uncertainties are limited by the quality of the input images. Recent works have demonstrated how to obtain approximate uncertainties in computationally efficient ways (Gal & Ghahramani 2015, 2016; Kendall & Gal 2017). Here we review the principles of obtaining model uncertainties with variational inference.

2.1. Epistemic Uncertainties in Neural Networks

Bayesian neural networks offer a probabilistic framework to predict values of interest in classification and regression tasks. Instead of having deterministic values, the weights of these networks are specified by probabilistic distributions. This is achieved by placing a prior over the network weights. Given a network with weights ${\boldsymbol{\omega }}$ and a training data set with input images ${\boldsymbol{X}}=\{{{\boldsymbol{x}}}_{1},\,...,{{\boldsymbol{x}}}_{N}\}$ and the corresponding output parameters ${\boldsymbol{Y}}=\{{{\boldsymbol{y}}}_{1},\,...,{{\boldsymbol{y}}}_{N}\}$, the posterior of the network weights, $p({\boldsymbol{\omega }}| {\boldsymbol{X}},{\boldsymbol{Y}})$, captures the plausible network parameters. With this posterior, we can calculate the probability distribution of the values of an output ${\boldsymbol{y}}$ for a new test input point ${\boldsymbol{x}}$ by marginalizing over all possible weights ${\boldsymbol{\omega }}$:

Equation (1)

Although simple to formulate, in practice performing inference with these networks is a difficult task. Typically, the posterior $p({\boldsymbol{\omega }}| {\boldsymbol{X}},{\boldsymbol{Y}})$ cannot be evaluated analytically. Different approximations have been introduced in order to calculate this distribution, with variational inference (Jordan et al. 1999) being the most popular. In variational inference, $p({\boldsymbol{\omega }}| {\boldsymbol{X}},{\boldsymbol{Y}})$ is replaced by an approximating variational distribution, $q({\boldsymbol{\omega }})$, with an analytic form. The parameters defining this distribution are then optimized such that $q({\boldsymbol{\omega }})$ is as close as possible to the true posterior. This is performed by minimizing their Kullback–Leibler (KL) divergence, a measure of similarity between two distributions. Equation (1) can then be written as

Equation (2)

It has been shown that minimizing the KL divergence is equivalent to maximizing the log-evidence lower bound,

Equation (3)

with respect to the variational parameters defining $q({\boldsymbol{\omega }})$ (Gal & Ghahramani 2015, 2016).

The form of this variational distribution is an arbitrary choice. One possible form is to define $q({\boldsymbol{\omega }})$ for the ith layer of the neural network such that

Equation (4)

where $[{z}_{i,j}{]}_{j=1}^{{J}_{i-1}}$ is a vector of length ${J}_{i-1}$ containing the Bernoulli-distributed random variables for unit $j=1,\,\ldots ,\,{J}_{i-1}$ in layer $i-1$ with probabilities pi, and ${{\boldsymbol{M}}}_{i}$ is the ${J}_{i}\times {J}_{i-1}$ matrix of the variational parameters to be optimized (Gal & Ghahramani 2016). The integral in Equation (3) can be numerically approximated with a Monte Carlo integration. Sampling from $q({{\boldsymbol{\omega }}}_{i})$ is now equivalent to performing dropout on layer i in a network with weights that are ${{\boldsymbol{M}}}_{i}$. Dropout (Srivastava et al. 2014) is a technique that was introduced in order to prevent networks from overfitting. For each forward pass, individual nodes are dropped out, i.e., set to zero, with probability p, known as the dropout rate.

The first term in Equation (3) is the log-likelihood of the output parameters for the training set. As shown in Gal & Ghahramani (2015), the second term, the KL term, can be approximated as an L2 regularization. We can then write this as

Equation (5)

where ${ \mathcal L }({{\boldsymbol{y}}}_{n},{\hat{{\boldsymbol{y}}}}_{n}({{\boldsymbol{x}}}_{n},{\boldsymbol{\omega }}))$ is the log-likelihood of the network's prediction ${\hat{{\boldsymbol{y}}}}_{n}({{\boldsymbol{x}}}_{n},{\boldsymbol{\omega }})$ for training input ${{\boldsymbol{x}}}_{n}$ with true values ${{\boldsymbol{y}}}_{n},\lambda $ is the strength of the regularization term, and ${\boldsymbol{\omega }}$ are sampled from $q({\boldsymbol{\omega }})$. Once the network is trained, performing inference can be done by approximating Equation (2) with a Monte Carlo integral by predicting the output values multiple times using dropout, a procedure known as Monte Carlo dropout.

In short, in order to obtain a network's epistemic uncertainties we can simply train it with dropout before every weight layer and optimize a cost function given by the log-likelihood with an L2 regularization term. At test time, each realization of the network's outputs—given by a forward pass with a random dropout—is a sample from the approximate parameter posterior. Obtaining epistemic uncertainties is then done by feeding a given input example multiple times to the network and collecting the outputs.

2.2. Aleatoric Uncertainties

For regression tasks, the log-likelihood in Equation (5) can be written as a Gaussian log-likelihood of the form

Equation (6)

where ${\sigma }_{k}$, the observation noise parameter, represents the uncertainties in the k'th parameter arising from inherent corruptions to the input or output data.

Our training data are made up of noisy images, yet their corresponding outputs consist of the true (uncorrupted) parameters. Because of varying levels of noise in input data, however, the predictions of the network at test time have different levels of uncertainties. It is therefore sensible to use a heteroscedastic model—a model that can capture different levels of uncertainties in its output predictions. This can be achieved by training networks to predict their uncertainties. In practice, a single network is trained to predict both the parameters of interest and their associated ${\sigma }_{k}$, so that the final layer has twice as many neurons as the number of parameters.

Although we train the networks to predict their uncertainties, the values of ${\sigma }_{k}$ are not included in the training data. Instead, they are learned from optimizing the log-likelihood, i.e., the cost function. The second term in Equation (6) ensures that large values of ${\sigma }_{k}$ are penalized, while the first term discriminates against small values.

2.3. Combining Aleatoric and Epistemic Uncertainties

In order to obtain the total uncertainty of a network in its predictions, we combine its aleatoric and epistemic uncertainties. We first perform Monte Carlo dropout by feeding an input image multiple times to the network, each time performing dropout and collecting the outputs. This provides samples from the predictive of the network, capturing the epistemic uncertainties. Each prediction in this sample also has its associated aleatoric uncertainty, represented by ${\sigma }_{k}$. To add these uncertainties we draw a random number from a normal distribution with a variance of ${\sigma }_{k}^{2}$ for each sample and add it to the predicted value. We use a normal distribution as we have adopted a Gaussian likelihood for optimizing the network.

3. Application to Strong Gravitational Lensing

We trained AlexNet (Krizhevsky et al. 2012) to predict the parameters of the Singular Isothermal Ellipsoid with external shear in addition to the total flux magnification. The model is parameterized with its Einstein radius, ${\theta }_{{\rm{E}}}$, Cartesian components of complex ellipticity, ${\epsilon }_{x}$ and ${\epsilon }_{y}$, coordinates of the center of the lens, x and y, Cartesian components of complex shear, ${\gamma }_{x}$ and ${\gamma }_{y}$, and the total lensing flux magnification, ${\mu }_{{\rm{F}}}$. We use dropout layers before every weight layer, including convolutional layers (Gal & Ghahramani 2016). The final layer contains sixteen neurons, with the first half predicting the lensing parameters and the second half the observation noise scalars, ${\sigma }_{k}$. Instead of directly predicting ${\sigma }_{k}$, we predict the log-variance, ${s}_{k}=\mathrm{log}{\sigma }_{k}^{2}$, resulting in improved numerical stability and avoiding potential division by zero (Kendall & Gal 2017). The cost function to minimize for the nth example in the training set is the negative log-likelihood plus the L2 regularization term, written as

Equation (7)

where k averages over all the output parameters. The value of the regularization term, λ, is determined using a validation data set.

Our training, validation, and test sets are simulated images and described in Hezaveh et al. (2017). Here, we have also added external shear to the simulations, with a maximum shear amplitude of 0.3. We have also made the network predict the total flux magnification (the ratio of the observed to the intrinsic source flux). For numerical stability, we divide the flux magnification by a factor of 16 in order to allow all of the parameters to span a similar numerical range. We train the network with dropout keep rates (one minus the dropout rate) of $80 \% ,90 \% ,97 \% ,99 \% $, and 100%. The network weights are initialized at random and trained with stochastic gradient descent, and the regularization term is set to $\lambda =0.0001$.

We test the performance of the network on 1000 simulated examples. For each example, we feed the input to the network 2000 times, effectively drawing $2000$ samples from the approximate posterior. Each sample contains eight lensing parameters in addition to their associated aleatoric uncertainties, ${s}_{k}=\mathrm{log}{\sigma }_{k}^{2}$. For each parameter of each example, we then draw a random number from a normal distribution with variance ${\sigma }_{k}^{2}$ and add it to the associated predicted parameter. The resulting sample of parameters now includes both the aleatoric and epistemic uncertainties. Figure 1 shows the estimated flux magnification against the true value of this parameter for 200 test examples. The errorbars show the $68.3 \% $ confidence intervals. The orange, blue, and black errorbars correspond to examples where the true values fall within the $68.3 \% ,95.5 \% $, and $99.7 \% $ confidence intervals, respectively.

Figure 1.

Figure 1. Predicted 68.3% uncertainties for lensing flux magnification, ${\mu }_{{\rm{F}}}$, as a function of the true value of this parameter. The orange, blue, and black errorbars correspond to examples where the true values fall within the $68.3 \% ,95.5 \% $, and $99.7 \% $ confidence intervals, respectively.

Standard image High-resolution image

3.1. Tests on the Accuracy of the Combined Uncertainties

In order to evaluate the accuracy of the obtained uncertainties, we calculate their coverage probabilities, defined as the fraction of the test examples where the true value lies within a particular confidence interval. We calculate these for the 68.3%, 95.5%, and $99.7 \% $ confidence levels corresponding to $1,2$, and $3\sigma $ confidence levels of a normal distribution. For each input, we define the $68.3 \% $ confidence interval as the region containing $68.3 \% $ of the most probable values of the integrated probability distribution. We then calculate the fraction of test examples for which this interval contains the true values of the parameters. An accurate, unbiased interval estimator should yield a coverage probability equal to the confidence level of the interval for which it was calculated.

Figure 2 shows the resulting coverage probabilities, averaged over all parameters, for networks trained with different dropout keep rates. We notice that with lower keep rates, the networks overestimate their errors, resulting in conservative estimates, while with keep rates of 99% and 100% the estimations are mildly permissive. For the network trained with a keep rate of 97%, the resulting coverage probabilities are very close to their corresponding confidence levels, resulting in accurate uncertainties. This suggests that the dropout rate should be regarded as a variational parameter of the model. These results could be improved for individual parameters by using concrete dropout (Gal et al. 2017), allowing the tuning of the dropout rate for different parameters.

Figure 2.

Figure 2. Coverage probabilities, averaged over all of the parameters, for networks trained with different dropout keep rates. From dark to pale blue, the shades correspond to coverage probabilities calculated for the $68.3 \% ,95.5 \% $, and $99.7 \% $ confidence intervals. The horizontal red dashed lines show the ideal values of the coverage probabilities for these confidence intervals (equal to the confidence levels). Networks trained with a lower keep rate overestimate their errors, while high keep rates (99% and 100%) result in mildly permissive uncertainties. For the network trained with a keep rate of 97%, the resulting coverage probabilities are very close to their corresponding confidence levels, resulting in accurate uncertainties.

Standard image High-resolution image

The results of the coverage probabilities for individual parameters for this network are summarized in Table 1. Each column shows the coverage probability for a different lensing parameter. The test data contain varying levels of random Gaussian noise, uniformly distributed to result in maximum per-pixel signal to noise ratios between 10 and 100. The bottom row shows the median standard deviation of the resulting parameter uncertainties, in effect a measure of the precision of the estimated parameters. We find that these coverage probabilities are sufficiently close to the ideal values to allow the uncertainties to be used for most practical purposes (e.g., Sonnenfeld et al. 2015).

Table 1.  Coverage Probabilities for Individual Parameters for the Network Trained with 97% Keep Probability

Confidence Level ${\theta }_{{\rm{E}}}$ ${\epsilon }_{x}$ ${\epsilon }_{y}$ x y ${\gamma }_{x}$ ${\gamma }_{y}$ ${\mu }_{{\rm{F}}}$
68.3% 74.4 62.1 63.4 73.8 72.5 61.9 62.2 66.4
95.5% 95.1 93.8 94.2 97.0 97.2 93.4 93.2 94.7
99.7% 99.5 99.2 99.2 99.0 99.5 99.2 98.6 99.5
Median Precision 0.02 0.04 0.04 0.02 0.02 0.02 0.02 0.42

Note. The columns shows the coverage probabilities for the Einstein Radius, ${\theta }_{{\rm{E}}},x$, and y-components of complex ellipticity, ${\epsilon }_{x}$ and ${\epsilon }_{y}$, coordinates of the center of the lens, x and $y,x$, and y-components of complex shear, ${\gamma }_{x}$ and ${\gamma }_{y}$,and the total lensing flux magnification, ${\mu }_{{\rm{F}}}$. The bottom row shows the median standard deviation of the resulting parameter uncertainties, a measure of the precision of the estimated parameters.

Download table as:  ASCIITypeset image

We also calculated the coverage probabilities for batches of test data with fixed noise levels in order to examine if the network uncertainties were able to adapt to different levels of noise in input data. For example, we found coverage probabilities of $74.7 \% ,96.8 \% ,99.5 \% $ for the Einstein radius for batches with a maximum S/N of 100 per pixel, while these values are $71.0 \% ,95.9 \% $, and $99.4 \% $ for data with an S/N of 10. Figure 3 shows the average standard deviation of the resulting uncertainties, i.e., their precision, as a function of the amplitude of noise in input data. The curves correspond to different parameters of the model. The network was only trained with data with a noise rms less than 0.1 (the shaded part of the figure). The intercept on the left side indicates the accuracy of the network for samples with no noise. As expected, noisier data results in larger uncertainties, including for noise levels higher than those in the training set.

Figure 3.

Figure 3. Standard deviation of the estimated uncertainties averaged over the test sample as a function of the amplitude of noise. The curves correspond to the eight output parameters. The network was trained with data containing noise with an rms less than 0.1 (shaded region). Noisier data results in larger uncertainties, even for levels higher than those in the training data.

Standard image High-resolution image

Figure 4 shows five representative examples of the test images with different levels of uncertainties in the predicted parameters. The uncertainties, averaged over all of the parameters, are marked in each panel. As expected, lensing configurations close to Einstein rings (leftmost panel), which provide more stringent constraints for matching multiple images, result in more precise estimates, while configurations with only a pair of compact images (rightmost panel) result in large uncertainties.

Figure 4.

Figure 4. Visual inspection of five test images with increasing uncertainties in their obtained parameters. As expected, lensing configurations with multiple opposing images and close to Einstein rings result in more precise estimates, while configurations similar to panel 5, with only one pair of compact images, have large uncertainties. All of the images contain similar noise levels. The parameter uncertainties rms of each configuration (averaged over all of the parameters) is given in each panel.

Standard image High-resolution image

4. Discussion and Conclusion

The results of Table 1 demonstrate that neural networks can produce accurate interval estimates of lensing parameters, with a precision comparable to that obtained with traditional lens-modeling methods (Hezaveh et al. 2017).

Although networks can be made to predict their aleatoric uncertainties, the low coverage probabilities for the network trained without dropout (Figure 2) indicate that including the epistemic contributions can improve the uncertainty estimates. The form of the variational distribution is an arbitrary choice. Bernouli distributions, however, result in the Monte Carlo calculation of the integrals in Equations (2) and (3) to be equivalent to performing dropout, a widely implemented feature of most neural network libraries. This allows for the trivial implementation of approximate Bayesian neural networks using existing tools. Training with dropout results in no additional increase in computational time complexity. At test time, drawing a few hundred samples from the posterior can be done in a few seconds on a single graphics processing unit, offering more than seven orders of magnitude improvement in speed compared to traditional modeling methods (e.g., Nightingale et al. 2017).

Table 1 shows that even when averaged coverage probabilities are equal to their corresponding confidence levels, these probabilities for individual parameters may slightly deviate from their ideal values. In future work we plan to use concrete dropout (Gal et al. 2017), which allows for the tuning of the dropout parameter for each individual predictions, resulting in more accurate marginalized uncertainties for individual parameters.

Although we chose a Gaussian form for the aleatoric uncertainties, the total probability distributions could be highly non-Gaussian, due to the contribution of the epistemic uncertainties. The samples drawn using Monte Carlo dropout reflect the posterior of the parameters, influenced by the true degeneracies in the models, the distributions of the parameters in the training data, and the error of the networks. We interpret the uncertainties resulting from modeling the σ matrix to be the marginalized distributions for the output parameters. If the joint distributions of the parameters are desired, it should be possible to also predict the off-diagonal elements. We defer this study to future work.

Neural networks allow for the fast estimation of parameters from input data. Here we showed that they can also produce accurate estimates of the uncertainties of lensing parameters. This makes them a suitable tool for the analysis of large samples of data or for the analysis of complex models, where exploring the model parameter space with maximum likelihood methods could be slow and intractable. Given the large volumes of data expected from upcoming surveys, they can play a crucial role in astrophysical data analysis.

We thank Phil Marshall, Gil Holder, and Roger Blandford for comments on the manuscript. We thank Stanford Research Computing Center for providing computational resources. Support for this work was provided by NASA through Hubble Fellowship grant HST-HF2-51358.001-A awarded by the Space Telescope Science Institute, operated by the Association of Universities for Research in Astronomy, Inc., for NASA, under contract NAS 5-26555.

Please wait… references are loading.
10.3847/2041-8213/aa9704