Hierarchical Inference of the Lensing Convergence from Photometric Catalogs with Bayesian Graph Neural Networks

We present a Bayesian graph neural network (BGNN) that can estimate the weak lensing convergence (κ) from photometric measurements of galaxies along a given line of sight (LOS). The method is of particular interest in strong gravitational time-delay cosmography (TDC), where characterizing the “external convergence” (κ ext) from the lens environment and LOS is necessary for precise Hubble constant (H 0) inference. Starting from a large-scale simulation with a κ resolution of ∼1′, we introduce fluctuations on galaxy–galaxy lensing scales of ∼1″ and extract random sight lines to train our BGNN. We then evaluate the model on test sets with varying degrees of overlap with the training distribution. For each test set of 1000 sight lines, the BGNN infers the individual κ posteriors, which we combine in a hierarchical Bayesian model to yield constraints on the hyperparameters governing the population. For a test field well sampled by the training set, the BGNN recovers the population mean of κ precisely and without bias (within the 2σ credible interval), resulting in a contribution to the H 0 error budget well under 1%. In the tails of the training set with sparse samples, the BGNN, which can ingest all available information about each sight line, extracts a stronger κ signal compared to a simplified version of the traditional method based on matching galaxy number counts, which is limited by sample variance. Our hierarchical inference pipeline using BGNNs promises to improve the κ ext characterization for precision TDC. The code is available as a public Python package, Node to Joy ⏬.

1. INTRODUCTION Galaxies in the sky may appear to us distorted in shape and magnified (or de-magnified) in brightness.The responsible phenomenon is gravitational lensing: massive structures along our line of sight curve spacetime, thereby gravitationally bending the path of light as it travels to us from the background source.When measured from a large sample of galaxies, the lensing distortions probe the total matter distribution in the Universe, both luminous and dark.A quantity of particular importance in cosmology and galaxy evolution studies is the convergence (κ), defined as the total integrated mass along a line of sight weighted by the lensing efficiency, which peaks for mass located roughly midway between us and the source.
Reconstructing a reliable map of the density field κ across the sky, or mass mapping, can inform the relationship between luminous matter and dark matter, known as the galaxy-halo connection.Mass maps have been jointly analyzed with maps of stellar mass, galaxies, and galaxy clusters to constrain the galaxy bias-the statistical relation between the distribution of galaxies and matter (e.g.Chang et al. 2016).See Wechsler & Tinker (2018) for a review.
Mass maps can also improve weak lensing cosmology when used alongside shear maps.Traditional analyses using two-point correlations of the directly measurable shear field are limited to modeling zero-mean Gaussian random fields in the matter density.Extending to higherorder statistics can yield much more information about structure formation, including any non-Gaussianities in the early Universe that would propagate at large scales into the late Universe, as well as those that may arise at small scales later due to nonlinear gravitational collapse (Jeffrey et al. 2018).
For reconstructing wide-field mass maps, variants of the direct shear-to-convergence inversion algorithm by Kaiser & Squires (1993) ("KS") are commonly used (Van Waerbeke et al. 2013;Vikram et al. 2015;Chang et al. 2015;Liu et al. 2015;Chang et al. 2018;Oguri et al. 2018).The KS method does not account for missing data or noise, so there have also been efforts to bring in different priors about the convergence field (e.g.Wiener 1949;Leonard et al. 2014;Lanusse et al. 2016).A related line of work deals with magnification rather than convergence (e.g.Gunnarsson et al. 2006;Ménard et al. 2010;Morrison et al. 2012).
Most sightlines in the Universe are only weakly lensed.But in the rare event that a halo lines up in front of a source, the lensing is strong enough that the source appears copied into multiple images.The photons from each image arrive at our telescope with relative time delays.If the strongly lensed source is time-variable, e.g. a quasar or a supernova, the time delays can be measured and fed into a cosmographic analysis to infer the Hubble constant (H 0 ) (Refsdal 1964).Typically in time delay cosmography, the external lensing perturbations are approximated as a quadrupole lens parameterized by external convergence (κ ext ) and shear.The external shear can be constrained by modeling the lenses from imaging data, but κ ext cannot-a manifestation of the mass-sheet degeneracy (Falco et al. 1985).We therefore require external tracers of the density field in addition to the strong lens modeling.Suyu et al. (2010) and Fassnacht et al. (2011) pioneered the method of estimating κ ext based on galaxy number counts.In brief, the method involves (1) counting the galaxies around a lens system, (2) comparing the resulting counts against those of a control field, e.g. the Cosmic Evolution Survey (COSMOS) field (Scoville et al. 2007) and the Canada-France Hawaii Telescope Lensing (CFHTLenS) Survey (Heymans et al. 2012), to obtain relative counts, and (3) selecting lines of sight of similar relative counts, along with their associated convergence values, from a numerical N-body simulation with an assumed galaxy-halo connection.The matched κ samples can then be interpreted as κ ext posterior samples for a sightline, where the data for the sightline has been reduced to number counts as a summary statistic.This method of matching summary statistics rather than evaluating a specific likelihood function can be cast as approximate Bayesian computation (ABC) (Birrer et al. 2019).Note that the galaxy-halo connection encoded in the the N-body simulation and the semi-analytic galaxy evolution model is a key source of uncertainty in the κ constraints.
Since then, the method has evolved to accommodate summary statistics beyond simple galaxy counts.Suyu et al. (2013) used the external shear inferred from lens modeling as an additional constraint.Greene et al. (2013) explored schemes to weight number counts by the galaxy redshift, stellar mass, and projected separation from the line of sight.The additional information improved the κ ext constraints for the most overdense lines of sight but helped little for the less dense ones.They found that the underdense sightlines yielded the best κ ext constraints, at a residual uncertainty of σ κext 0.03, corresponding to an uncertainty on H 0 comparable to that from lens modeling.Alternatively, Collett et al. (2013) reconstructed κ ext using a galaxy halo model, calibrating for the effect of dark structures and voids.Similarly to Greene et al. (2013), they also reported the most precise estimates for environments with low κ ext .McCully et al. (2014) devised the "flexion shift" metric quantifying the contribution of a line-of-sight object to the lens potential.For the ground-truth κ ext , all of these studies used the 1.7 deg 2 mock sky maps of κ ext obtained by raytracing through the Millennium Simulation (Springel et al. 2005;Hilbert et al. 2009).The analysis methods shed light on the spectroscopic data previously collected around strong lenses for line-of-sight assessment (e.g.Fassnacht et al. 2002Fassnacht et al. , 2006;;Momcheva et al. 2006).
The H0LiCOW/COSMOGRAIL/STRIDES/SHARP collaborations (hereafter TDCOSMO5 collaboration) currently constrain κ ext for each lens independently when inferring H 0 from the combined lens sample.With this standard procedure, TDCOSMO achieved ∼2% precision on H 0 from the seven lenses (Rusu et al. 2020;Wong et al. 2019;Chen et al. 2019;Shajib et al. 2019;Millon et al. 2020).The uncertainty increased to ∼5% when the lens mass models were made significantly more flexible (Birrer et al. 2020).In relaxing assumptions on the lens mass model, Birrer et al. (2020) marginalized over the population offset in the mass sheet that is internal to the main deflector, which manifests as a transform of its mass profile.While these analyses did not hierarchically account for the external aspect of the mass sheet due to line-of-sight structure, the interpretation of the mass sheet can be extended to include external effects.
Hierarchically inferring the population mean of κ ext and marginalizing over it is important, because any bias in the assumed population mean of κ ext directly biases H 0 .Note that the mean κ ext may not necessarily vanish for an ensemble of lenses due to selection effects, e.g.lens galaxies tend to lie in groups (Blandford et al. 2000), causing a slight preference for lens systems with overdense lines of sight (Fassnacht et al. 2011;Collett & Cunnington 2016).Any discrepancy between our prior and the actual κ ext population can result in bias, so we must hierarchically infer the actual population statistics and account for it in TDC analysis.The Legacy Survey of Space and time (LSST) at the Vera Rubin Observatory is predicted to discover tens of thousands of lenses (Collett & Cunnington 2016), among them ∼8,000 lensed quasars (Oguri & Marshall 2010).As we scale up to such a large sample, hierarchical inference of κ ext can combine the information from all sightlines, each with limited signal on κ ext , to minimize systematic bias on H 0 .
Although uncertainties in other components of time delay cosmography, such as lens modeling and time delay measurements, have reduced over time, κ ext remains a significant source of uncertainty on H 0 .The primary drawback of matching relative number counts to estimate κ ext is that the wealth of photometric observations for each galaxy-such as the measured magnitudes in each bandpass filter, positions, and shapes-become reduced to a few numbers for the whole sightline.We seek to extract information more efficiently by enlisting a neural network to process all available observations.Namely, we design a Bayesian graph neural network (BGNN) that can take in the measurements of a variable number of galaxies observed around a given line of sight and output the full posterior probability density function (PDF) over κ.Training the BGNN on a galaxy catalog with associated κ labels allows it to implicitly learn the distribution of dark matter in galaxies and clusters encoded in the underlying simulation.This precludes the need to manually correct for the mass that is not in galaxy or cluster halos or assume overly simple models for the lensing mass, as done in Collett et al. (2013).
This paper seeks to improve the κ ext estimation on three fronts: (1) proposing the BGNN as a novel κ ext estimation engine, (2) combining the BGNN constraints into hierarchical inference of the population κ ext , and (3) enhancing the lensing scales of large-scale simulations to adequately describe lensing effects on the scale of galaxyscale strong lenses.We ask the following questions: • Is the Bayesian graph neural network (BGNN) capable of accurately and precisely estimating κ for individual sightlines?
• When propagated into a hierarchical inference framework, do the BGNN-inferred posteriors lend themselves to precise and accurate recovery of hyperparameters governing κ in the population?This investigation seeks to quantify population-level selection effects in regimes where insufficient information is available on an individual sightline-bysightline basis.
• How does the BGNN compare with the ABC technique of matching number counts?
• Which environments lend themselves to better κ constraints?At what extremes does this method break?
To answer these questions, we train our BGNN on a realistic galaxy catalog with associated κ labels and design a suite of numerical experiments to validate our approach.
We organize this paper as follows.Section 2 provides an introduction to time delay cosmography, with a focus on the impact of κ ext on H 0 inference.Section 3 details our hierarchical inference pipeline.Section 4 reports our κ recovery results on 1,000 test sightlines.Section 5 situates our findings within the context of time delay cosmography and suggests next steps.
We reserve a bulk of the details about our simulations to the Appendix (Section 6).There, we start with a primer on the lensing formalism and describe how we use multi-plane raytracing to post-process an existing galaxy catalog and κ map, produced in the weak lensing resolution, for strong lensing applications.
Throughout, we will present comparisons with the method of matching summary statistics, in order to highlight the difference between a neural net and an ABCbased algorithm.Our implementation of the summary statistics matching does not represent the TDCOSMO implementation.TDCOSMO uses orders of magnitude more sightlines, allowing them to match simultaneously to two sets of summary statistics, whereas we match to one summary statistic at a time.Additionally, TD-COSMO incorporates external shear information from the lens modeling to further constrain their κ ext .This is not possible within our context, because we only operate on sightlines without strong lensing.
We release with this publication the open-source Dark Energy Science Collaboration (DESC) Python package Node to Joy 6 .This package implements the com-6 http://github.com/jiwoncpark/node-to-joyplete pipeline, including the structure-enhanced raytracing, training and evaluation of the BGNN, comparison with summary statistics matching, and the hierarchical inference analysis.

STRONG LENSING TIME DELAY
COSMOGRAPHY Let us begin by reviewing the basic principles of time delay cosmography (Refsdal 1964).Readers are referred to recent reviews, e.g.Treu & Marshall (2016), for more details.When light rays from a distant source are deflected by some foreground lens, the travel time from the source to the observer depends on both their path lengths and the gravitational potential they traverse.Assuming a single thin lens, the excess time delay of an image at position θ originating from a source at position β relative to an unperturbed path is Here, is the Fermat potential (Schneider 1985;Blandford & Narayan 1986), defined for the lensing potential ψ(θ), a two-dimensional projection of the three-dimensional Newtonian potential Φ(ηθ, η): D ∆t is the time delay distance: where z d is the redshift of the deflector galaxy and D d , D s , D ds are the angular diameter distances to the deflector, to the source, and between the deflector and the source, respectively.We have that D ∆t ∝ 1/H 0 .Additionally, structures along the line of sight (LOS) to the strong lens introduce additional weak lensing perturbations.The resulting convergence affects the time delays while keeping imaging observables the same under linear transformations of the lens equation-the so-called mass-sheet degeneracy (MSD).As shown by Falco et al. (1985), remapping the reference mass distribution κ, for any scalar λ, as while isotropically scaling the source plane as β → λβ results in the same dimensionless observables (image pixel values) but different time delays.The infinite family of solutions to the lens equation results in a range of inferred properties of the deflector and the source.
The mass sheet parameter λ in Equation 5 may be internal to the main deflector, affecting its kinematics, or it may stem from the external line-of-sight structure.It is common practice to express the external portion λ ext as an equivalent additional mass sheet at the redshift of the main deflector with uniform surface mass density, called external convergence and denoted κ ext (Schneider 1997;Keeton et al. 1997).In terms of λ ext , we have If we were to use a model not accounting for κ ext (i.e.fixing κ ext =0) to infer the time delay distance D (0) ∆t , then the true time-delay distance can be recovered using a separately constrained κ ext as: In terms of H 0 , this relation is: Underestimating κ ext thus leads to an upward bias on H 0 and vice versa.
The external convergence κ ext is a product of three different convergence values (Birrer et al. 2020;Fleury et al. 2021): where κ s , κ d , κ ds correspond to the integrated convergence along the strong lens line of sight from the observer to the source, from the observer to the strong lens, and from the strong lens to the source, respectively.These component convergence values transform the background angular diameter distances (D BG ) from the homogeneous background metric given by the cosmological model, without any perturbations, into the angular diameter distances along the specific line of sight of the strong lens (D SL ).That is, We apply our method only to κ s , but it can be generalized to accommodate all three angular diameter distances.
Because each sightline contains limited information about κ ext , the inference of κ ext necessitates a hierarchical approach.For a sample of N lens lenses, we can apply the simplified and analytical error propagation in Birrer et al. (2021) to evaluate the contribution of κ ext to the overall H 0 error budget.We decompose the H 0 uncertainty into two terms: one term capturing global shifts in the inferred H 0 due to the mass sheet λ and another term incorporating uncertainties introduced by individual lenses, including uncertainties from the time delay measurements and the inferred Fermat potentials of main deflectors.The fractional uncertainty on H 0 can be written as , (10) where λ refers to the population mean in λ and σ(x) to the 1σ uncertainty on the quantity x.
Note that λ of Equation 10 includes both the internal and external mass sheets.Other probes that are sensitive to both can also constrain λ but would not allow such a separation in the physical interpretation.The contribution of the external portion λext to term 1 of Equation 10is This represents the main contribution of κ ext to the overall H 0 error budget.There is a smaller, second-order contribution that enters term 2 in Equation 10: where σ κext denotes the 1σ scatter in the population κ ext .This contribution scales inversely with N lens and becomes subdominant to other sources of uncertainty, especially in the large-N lens setting.Suppose we estimate κext for the population mean in κ ext but the truth lies offset from this, at κext,true .It follows from Equation 7that the fractional bias on H 0 introduced by an erroneous estimate κext is Frac.bias on That is, the error on κext biases H 0 almost directly.
In this paper, we perform hierarchical inference to obtain the full posterior PDF over the population κ ext statistics, κext and σ κext (see Section 3.4).Equation 11and, to a lesser extent, Equation 12 allow us to contextualize our hierarchical constraints within the overall H 0 error budget.

METHODS
In this section, we outline the methodology for the various stages of our κ inference pipeline.Figure 1 illustrates our pipeline as a flowchart and probabilistic graphical models (PGMs) for training and inference.In Section 3.1, we describe the simulated training data.In particular, we state the numerical choices made when postprocessing the existing galaxy catalog and a map of weak lensing κ to generate the training set suitable for strong lensing applications.Then, in Section 3.2, we explain how the BGNN extracts the κ posteriors from individual sightlines.Alternatively, the traditional summary statistics method can be used to obtain the posteriors, as described in Section 3.3.Lastly, the κ posteriors, whether from the BGNN or the summary statistics matching, enter the hierarchical inference framework to yield constraints on hyperparameters governing the population; Section 3.4 describes this process.

Simulated data
Our training set consists of a galaxy catalog queried around random sightlines and the associated κ labels.Both the input catalog and target labels are derived from CosmoDC2, a catalog based on a cosmological N-body simulation and a semi-analytic galaxy evolution model (Korytov et al. 2019a).The catalog contains the κ WL computed in the weak lensing (WL) regime (δ ∼ 1 ).We perform additional raytracing on top of the Cos-moDC2 halos to introduce fluctuations on a finer angular (δ ∼ 1 ) resolution required for strong lensing studies.Section 6.4.1 details this structure-enhanced raytracing procedure.The surface mass densities were estimated by rendering halos with the Navarro-Frenk-White (NFW) mass profile (Navarro et al. 1997), described in Section 6.4.2.Section 7 briefly summarizes the semianalytic models used to generate the galaxy catalogs.We train our BGNN on the photometric information associated with CosmoDC2 sightlines and the κ values that we postprocessed to the strong lensing resolution.The source redshift was fixed at z src = 2, typical of lenses expected to be discoverable in upcoming wide-area surveys (Collett 2015).This was a choice also made in (Li et al. 2020).
The goal of this paper is to validate our hierarchical κ inference pipeline within a toy testbed, by explicitly treating the training distribution as a prior and conducting retrieval tests using carefully chosen test sets.We choose a broad Gaussian distribution for the training κ to enable an analytical treatment of the BGNN prior.As we will see in Section 3.4, Gaussian priors are especially convenient in our hierarchical inference context, because it has support everywhere in R. We subsampled 200,000 sightlines from the phenomenological set of 600,000 Cos-moDC2 sightlines to follow κ ∼ N (0.01, 0.04) where N (µ, σ) is a normal distribution with mean µ and standard deviation σ.The validation set contains 1,000 examples, drawn from the same distribution as the training set.
Our primary target label for each sightline is κ.We additionally design per-galaxy labels, the redshift (z) and stellar mass (M ) of each galaxy included in the Cos-moDC2 catalog.We are motivated by the idea that training the BGNN to simultaneously predict z, M of each galaxy as well as κ may improve the inductive bias of the network for κ inference.We anticipate that it is important for the network to understand galaxy redshifts well, so that it can predict the relative κ contribution of the associated halos to the overall κ for the sightline, i.e. the lensing efficiency.The network must also internally learn to exclude a galaxy/halo from consideration if it is behind the source redshfit of z src = 2.We included stellar masses as a per-galaxy label to serve as proxies for halo masses, which were only available for central galaxies (populating main halos) in the CosmoDC2 catalog and not for satellite galaxies (populating subhalos).
Our input is the photometric catalog of galaxies queried within an aperture of radius 1 around a sightline.The input features of each galaxy were derived from the CosmoDC2 sky position and magnitudes in ugrizY.We compute the relative sky position of each galaxy from the central line of sight located at RA LOS , dec LOS and transform the distances to the flat sky.
Gaussian photometric errors corresponding to 5-year LSST conditions were simulated and added to the true ugrizY magnitudes.We adopted the analytic magnitude-dependent model for the 1-σ uncertainties described in Ivezić et al. (2019).The model includes dependencies on true magnitudes as well as band-dependent parameters like sky brightness, seeing, atmospheric extinction, instrumental noise, and the number of visits.Figure 9 shows that our error model agrees with the estimated photometric uncertainties in the DC2 catalog (LSST DESC et al. 2021), which simulates 5 years of the planned 10-year LSST survey.In addition, we applied a detection cut of i < 25.3, corresponding to the 10-year i-band gold sample (Gorecki et al. 2014).
In summary, the input to our model was a variablesized set of galaxies located within 1 of a sightline.Each galaxy carried eight features: the relative positional offset from the central line of sight in RA and dec and the ugrizY magnitudes with photometric noise added.

Bayesian graph neural network
Bayesian neural networks (BNNs) are probabilistic extensions to standard neural networks (Denker & LeCun 1991).As outlined in Jospin et al. (2020) 20) also generally increase with increasing κ.The source was located at z = 2 for all the sightlines.Best viewed in color.
fines the prior on the network weights and the predictive distribution on the target quantities, and a functional model, which defines the network architecture.We describe the stochastic model in Section 3.2.1 and the functional model in Section 3.2.3.

Posterior inference with Bayesian neural networks
It is instructive to decompose the uncertainty estimated by BNNs into two types: aleatoric (statistical, irreducible) and epistemic (systematic, reducible).
Originating from the intrinsic randomness in the datagenerating process, the aleatoric uncertainty persists in the limit of infinite training data.This type of uncertainty is explicitly modeled as the width of the distribution over the target quantities.Concretely, let a sightline with photometric observations d be given.Suppose it contains L galaxies, indexed l = 1, . . ., L. Our targets are κ and the set of redshifts and stellar masses of the individual galaxies, {z, M } ≡ {z l , M ,l } L l=1 .The aleatoric portion of our posterior is thus p (κ, {z, M }|d, W ), for a set of network weights, W .
In this paper, we adopt an independently Gaussian predictive distribution for κ and {z, M }, i.e.
where N (•|µ, σ) denotes the PDF of a Gaussian with mean and standard deviation µ, σ ∈ R. The BNN predicted the m, log s and {t l , log u l , v l , log w l } L l=1 for each sightline, where l indexes the galaxies in a sightline.We chose the Gaussian as a first-order approximation to the true posterior to focus on simple κ recovery tests.To control for effects from the prior, our training set has been subsampled from the original simulation to follow a Gaussian distribution in κ (Section 3.1).We train the network to jointly infer the per-galaxy properties {z, M } along with κ in order to improve the inductive bias of the BGNN (Battaglia et al. 2018).The idea is that explicitly training the network on {z, M } labels will guide its understanding of the per-galaxy contribution to the overall κ for the sightline.
Epistemic uncertainty stems from our limited knowledge of the perfect model.It is reducible in the sense that it can be explained away with sufficient training data.This type of uncertainty is often described as a distribution over the network weights W .Each realization of the weights corresponds to an alternative model, so integrating over this learned weight posterior amounts to Bayesian model averaging.Folding in the epistemic uncertainty, we have the full predictive distribution for κ of a given sightline: where we have made explicit the dependence on the training set by conditioning on the hyperparameters governing the training prior, Ω train .Not modeling the epistemic uncertainty at all reduces to simple density estimation, where the weight posterior p(W |Ω train ) is a delta function.In standard neural networks, which only give point estimates for the target parameters, both p(κ|d, W ) and p(W |Ω train ) are delta functions, so the predictive distribution in Equation 15 collapses to a delta function.
Exact evaluation of the integral in Equation 15requires averaging over all the weight configurations allowed by p(W |Ω train ), which is intractable.There exist several approximations (see, e.g.Charnock et al. 2020 for a review).Among them, we opt for Monte Carlo (MC) dropout, because it precludes the need to train an ensemble of BNNs (Gal & Ghahramani 2016;Kendall & Gal 2017).MC dropout is a Bayesian interpretation of regular dropout, whereby nodes are randomly dropped out, or set to zero, with some tuned probability.Mathematically, it replaces the true weight posterior p(W |Ω train ) with a variational distribution q θ ( Ŵ |Ω train ) parameterized by θ (Gal & Ghahramani 2016): where i indexes layers of the network and j the nodes in a given layer.Here, J i denotes the number of nodes in layer i, such that the weight matrix for layer i is W i ∈ R Ji×Ji−1 .When z i,j = 0, the input node j in layer i is dropped out.In practice, the per-layer dropout probability p i is replaced with a global dropout probability p drop .
The full predictive posterior with the variational approximation is thus

Optimization
The loss function is the negative log evidence lower bound (ELBO) over the N =200,000 examples in the training set, {d (n) , κ (n) , {z, M } (n) } N n=1 : where p(W ) is a prior on the network weights.To evaluate the first term in an unbiased way, we take a single MC sample Ŵ ∼ q θ ( Ŵ |Ω train ).Then W can be updated via gradient descent with respect to the realized sample.The second KL term is the "regularization" term that prevents the weights from deviating too far from our prior.This is intractable in its exact form, but re- when we assume a weight prior that can be factorized into a product of Gaussian priors in each layer.The length scale h is a hyperparameter that determines the width of the prior.Note that the dropout probability is also a hyperparameter in the formulation introduced here.It is not optimized along with W during training and was tuned manually as part of the hyperparameter search.We assume the same dropout probability p i = p drop for every layer i.For a given choice of p drop , h can be folded into the hyperparameter λ = h 2 (1 − p)/(2N ) controlling the L 2 regularization strength.We tune p drop , the initial learning rate, batch size, λ, and the number of BGNN layers (network depth) on a sparse random grid as part of our hyperparameter search.We train our BNN via minibatch gradient descent using the ADAM optimizer with a weight decay of λ =1e-4 and batch size of 1,000.The learning rate, initially 1e-3, is halved whenever the validation loss fails to decrease for 5 consecutive epochs.We stop training when the validation loss does not increase for 10 consecutive epochs.
It is common practice to transform the training input and labels so that they fall into a predefined range.This preprocessing step has the effect of facilitating optimization, as it promotes the numerical stability of the network's hidden units and their gradients.The input d and target κ labels are normalized so that they have a mean of 0 and standard deviation of 1 across the entire training set.

3.2.3.
V }, and passes them through five residual blocks.In each residual block, indexed b, the local (node-level) encodings {x } and global (graph-level) encoding u (b−1) are processed together through a series of multilayer perceptrons (MLPs) with residual skip connections to yield updated local and global encodings.The final encodings enter the last MLP, which produces the posterior PDFs over the local target quantities (z, M for each node) and the global target quantity κ.

Summary statistics matching
Matching summary statistics can be considered an ABC-based method, whereby summary statistics of a test sightline are matched against those of training sightlines with known κ, which define the prior (Birrer et al. 2019).We can compare the BGNN constraints with constraints produced by matching two types of summary statistics.One is the unweighted number counts (N sim ), defined simply as the number of galaxies observed around the sightline.The other is the inverse-distance-weighted number counts (N 1/r,sim ).For N sim observed galaxies in a field of view, where r i is the projected flat-sky distance of galaxy i from the central line of sight in arcseconds.The small value of 10 −5 was added to the denominator for numerical stability.The subscript "sim" in N sim and N 1/r,sim indicate that these are simplified statistics compared to the ones employed by TDCOSMO (e.g.Greene et al. 2013;Rusu et al. 2017).We did not consider more complex, modeldependent weighting schemes that incorporate inferred stellar masses and spectroscopic redshifts, e.g.z, L, M in Greene et al. (2013), in order to focus on those that could be computed directly from photometric observations.We also did not normalize the summary statistics with respect to the average in the simulations, because all the sightlines in this study were derived from the same simulations.The summary statistics matching was implemented as follows.The values of N sim (N 1/r,sim ) were computed for all the sightlines in the training and test sets.Then, for each test sightline, we queried the training sightlines with N sim (N 1/r,sim ) that matched the test sightline's own N sim (N 1/r,sim ) within some closeness threshold, which we denote ∆N sim (∆N 1/r,sim ).The κ corresponding to the matched training sightlines could then be interpreted as a posterior on κ, i.e. p(κ|d, Ω train ).The threshold ∆N sim (∆N 1/r,sim ) determines how much information is captured in the matched samples.As a rule of thumb, it should be kept as small as possible provided that it yields sufficient matched samples.A large threshold yields more samples, but the quality of the matching may be poor.A small threshold sacrifices the number of samples for a closer match to the test sightline's summary statistic and thus potentially its κ.At one extreme, ∆N sim = 0 means that a training sample must have the same exact N sim as the test sample in order to be considered "matched" and contribute to the posterior.For a given test sightline, ∆N sim was chosen to be the smallest among {0, 1, 2, 4} that yielded more than 100 matched samples.Likewise, N 1/r,sim was chosen to be the smallest among {1, 2, 4, 8} that yielded more than 100 matched samples.This resulted in ∆N sim = 1 for most sightlines and an average ∆N 1/r,sim of 1.3.
We stress that our matching scheme differs from that implemented by Greene et al. (2013) and Rusu et al. (2017) in two main ways.First, these studies binned the range of each summary statistic and reweighted the κ contribution of each matched sightline in the final posterior by the inverse of the bin count, so that each bin contributed equal weight.The reweighting was intended to prevent the κ posterior from being dominated by a large number of sightlines in a given N sim bin.We do not enforce such a reweighting with respect to the summary statistic, but reweight by the training distribution in κ when we hierarchically infer the hyperparameters (see Equation 23).Second, the N sim and N 1/r,sim were matched jointly, rather than independently.We chose not to apply the joint constraint, because our small training set size of 200,000 sightlines made it difficult to achieve sufficient sample statistics.Rusu et al. (2017) used a joint constraint from four types of summary statistics on 10 9 training sightlines and note that using more than four may result in sample sparsity.
The κ values of the matched training sightlines can be interpreted as posterior samples.We denote the matched samples and the posterior PDFs represented by these samples as and for consistency with the notation of the BGNN posterior in Equation 17.We condition on Ω train everywhere, because both the BGNN and summary statistics techniques depend on the distributions of sightlines in the training set.

Hierarchical Inference
We designed eight test sets, each containing 1,000 sightlines drawn from normal distributions with varying µ and σ (see Table 2).They included shifted test sets with particularly overdense (high-κ) sightlines, where strong lenses tend to occur (B3, C3, C4).In particular, C4 serves as an extreme stress test, as it has very little overlap with the training κ distribution.See Figure 10 for a visualization of the eight test distributions and their varying degrees of overlap with the training distribution.The size of 1,000 was chosen, because it was large enough to avoid small sample statistics but allowed experimental turnaround within reasonable time (see Section 4.3)..The κ posteriors in Equations 17, 21, and 22 are conditioned on Ω train , the set of assumptions governing the training set.Here, Ω train includes the particular choices of the underlying N-body simulation, semi-analytic models used to paint galaxies onto the halos, and the observation noise in the galaxy positions and magnitudes.The test set, in principle, may differ in any of these aspects.In this paper, we focus on training-test mismatches in the κ distributions, by designing test sets from the same simulation suite as the training set but subsampled to follow shifted distributions in κ.As described in Section 3.1, we assume Gaussian distributions for both the training and test distributions.
We hierarchically infer the hyperparameters governing the Gaussian test distribution, µ test , σ test , using importance sampling (Hogg et al. 2010;Foreman-Mackey et al. 2014;Wagner-Carena et al. 2020).The posterior on µ test , σ test takes the form: ).We choose a Gaussian distribution, with support in R, as the training prior, and ensure that it is sufficiently broad to make evaluation of the MCMC objective numerically stable.Because Equation 23is not guaranteed to be unbiased for finite M , we carried out convergence tests on the hierarchical results to determine our M =20,000.This is a conservative value, given that there are only two hyperparameters.We summarize the model (hyper)parameters and their (hyper)priors in Table 1.
We interpret the output of the BGNN and summary statistics matching as posterior PDFs, with the training prior applied.For evaluating κ recovery on individual sightlines, we work with likelihoods by dividing by the training prior: where p(κ|Ω train ) = N (0.01, 0.04) in our setup.
4. RESULTS AND DISCUSSION We organize our results as follows.In Section 4.1, we evaluate the BGNN's κ recovery performance on individual sightlines.Having established that the individual constraints are reasonable, we proceed to present the hierarchical inference results of the test population as a whole in Section 4.2.
As summarized in Table 2, our experiments vary in the mean (µ test ) and standard deviation (σ test ) of the test κ populations.Test set A is representative of the training set, i.e. follows the same distribution in κ.The four test sets in block C share an artificially narrow σ test of 0.005 but vary in the centers µ test .The three test sets in block B share a broader σ test of 0.02 and vary in the µ test .The resulting κ distributions are visualized in Figure 10.

Individual κ recovery
Let us first compare the accuracy of κ recovery across the BGNN and summary statistics methods for a range of κ spanning the training distribution.In Figure 4, we bin the representative test set A into 17 κ bins of fixed bin width 0.01 and plot the mean and standard deviation of the recovered κ in each bin, weighted by the estimated uncertainties for the individual sightlines.That is, given the central estimate (κ i ) and the associated 1σ uncertainty (σ i ) of the likelihoods for each sightline i, we computed the weighted mean and standard error (SE) using 1/σ 2 i as the weights.Explicitly, we computed for each bin: and where B was the total number of sightlines in the bin, W ≡ B i=1 1/σ 2 i was the sum of the weights, and the If all weights 1/σ 2 i are equal, the factor V reduces to 1/B, so Equation 26 becomes the conventional (unweighted) SE that scales with 1/ √ B. In Figure 4, the dots are computed from Equation 25 and the sizes of the error bars from Equation 26.Overall, our BGNN is more accurate than both N sim and N 1/r,sim in all κ bins.It boasts the highest accuracy in regions of κ best represented in the training distribution ∼ N (0.01, 0.04).On the other hand, all the methods reveal signs of upward bias in κ < −0.05 and downward bias in κ > 0.06 given the standard errors per bin.The bias is the worst for N sim , followed by N 1/r,sim and the BGNN.
Also interesting is the level of accuracy for each value of κ, rather than for each bin.To assess this, we apply a correction factor K ≡ 1/(1 − V ) (Bevington & Robinson 2003) to account for the varying bin count, i.e.
The error bars in Figure 5 are computed from Equation 27whereas the dots are the same as in Figure 4. We find that, on a per-κ (per-sightline) level, both the N sim and N 1/r,sim summary statistics reveal signs of upward bias in κ < −0.04 and downward bias in κ > 0.09.The BGNN predictions are consistent with the true κ for all κ values.We choose three metrics to quantitatively compare the κ inference performance of the BGNN and the summary statistics matching across experiments.The first metric is the log likelihood evaluated at the truth: i.e. the inferred likelihood from Equation 24 evaluated at a sightline's true κ value.Higher log p is more accurate and/or more precise.We also use the Median Absolute Error (MAE) and Median Absolute Deviation (MAD) as robust measures of accuracy and precision, respectively, with smaller values being better.These are given by: where κ sample ∼ p(d|κ, Ω train ) for a given sightline with true convergence κ true .The factor of 1/1.4826 converts the MAD of a Gaussian into the standard deviation, so we can interpret Equation 30 as a robust measure of standard deviation.Note that for the BGNN experiments, k samples constitute samples from the BGNN likelihood for this sightline, while for the summary statistics methods, k samples are the matched training set samples reweighted according to Equation 24.Table 3 compares these metrics across BGNN, N sim , and N 1/r,sim for a grid of κ bins.Values listed are the median and standard deviation taken across all the sightlines in each bin.The BGNN is generally more accurate and more precise compared to either summary statistic, in terms of log p, MAE, and MAD.Note, however, that these metrics weight each sightline equally.In order to account for the varying information content across the  25and the sizes of the error bars from Equation 26.The BGNN is more accurate than both N sim and N 1/r,sim in all κ bins.It boasts the highest accuracy in regions of κ best represented in the training distribution ∼ N (0.01, 0.04).On the other hand, all the methods reveal signs of upward bias in κ < −0.05 and downward bias in κ > 0.06.The bias is the worst for N sim , followed by N 1/r,sim and the BGNN.The bin count was 148 for all bins.Note that N sim and N 1/r,sim are simplified versions of the summary statistics employed by TDCOSMO (e.g.Rusu et al. 2017), and our analysis operates on much fewer training sightlines (2 × 10 5 compared to TDCOSMO's 10 9 ).27.The BGNN predictions are consistent with the true κ for all κ values.On the other hand, both the N sim and N 1/r,sim summary statistics reveal signs of upward bias in κ < −0.04 and downward bias in κ > 0.09.The bin count was 148 for all bins.Note that N sim and N 1/r,sim are simplified versions of the summary statistics employed by TDCOSMO (e.g.Rusu et al. 2017), and our analysis operates on much fewer training sightlines (2 × 10 5 compared to TDCOSMO's 10 9 ).sightlines, we examine the hierarchical inference results of population κ statistics in the next section.

Population inference
We proceed to present the hierarchical inference results on the population κ statistics.Recall that our experiments vary in the mean µ test and standard deviation σ test of the test populations.Here we compare the BGNN's recovery performance on our test sets to understand the sensitivity of the model across the hyperparameter space.The simple and analytic H 0 error decomposition by Birrer et al. (2021) described in Section 2 also allows us to connect the precision and accuracy of µ test recovery to the H 0 error budget.
Table 4 summarizes the bias and uncertainty of µ test constraints for all the experiments.Treating the sightlines in each experiment as a population of strong lensing sightlines, we can interpret our maximum-likelihood estimate of µ test as the central estimate of the external convergence, κtest , and the uncertainty on µ test as σ(κ ext ).Equation 11 then allows us to compute the contribution of the external environment to the overall fractional uncertainty on H 0 .
For all experiments, the BGNN introduces lower bias on H 0 (last column) than does either summary statistic.With the exception of the extremely overdense population C4, which introduces a 1.6% bias on H 0 , the bias level remains at sub-percent for the BGNN.The performances of all three methods follow patterns expected from the individual recovery results in Section 4.1-the best constraints on µ test come from environments best represented by the training set, i.e.B2, C2 that have κ ∼ 0. Notably, the percent bias on H 0 is only -0.1% on B2 and -0.2% on C2 for the BGNN.
The bias increases towards underdense and overdense populations in the tails of the training distribution, but the BGNN is significantly more robust than the summary statistics.Whereas the bias level only increased to ∼0.8% from C2 to C1/C3 for the BGNN, it increased to ∼2% for N sim and ∼1.5% for N 1/r,sim .A similar pattern holds for the broad B group of test sets as well, although the degree of deterioration is slightly worse for the B group because it contains more of the extreme sightlines.
We find that the 1σ uncertainty on H 0 (second-to-last column) is smaller than the bias on H 0 across the board, pointing to a global underestimation of the uncertainty on µ test .The only exception is the BGNN for B2, where the bias is small enough (-0.1%) that the estimated 1σ uncertainty of 0.14% can account for it.Even so, relative to the summary statistics, the BGNN estimates of the uncertainty are better, i.e. comes closer to covering the bias.
The hierarchical inference constraints for C3 is shown in Figure 6.This test set represents a particularly challenging environment with µ test shifted high and σ test artificially narrow.Our BGNN can accurately recover the high µ test and narrow σ test within its 1σ credible interval.The BGNN constraints translate into a 0.3% contribution to the H 0 uncertainty without evidence of bias.On the other hand, both the N sim and N 1/r,sim summary statistics lead to downward biases on µ test .They also overestimate σ test , which happens with underestimated uncertainties on individual κ.
Similarly, the constraints for the broad, overdense test set (Experiment B3) is shown in Figure 7. Out of all our test distributions, B3 best characterizes the κ ext distributions in the seven TDCOSMO lenses.Our BGNN can accurately recover the high µ test within its 2σ credible interval.Again, both the summary statistics underestimate µ test and overestimate σ test .
Our hierarchical inference pipeline reaches a numerical breaking point, however, with a test set shifted far away from the training set, e. BGNN can accurately recover the µtest, σtest within its 1σ credible interval.Both the N sim and N 1/r,sim summary statistics lead to downward biases on µtest.They also overestimate σtest.Note that N sim and N 1/r,sim are simplified versions of the summary statistics employed by TDCOSMO (e.g.Rusu et al. 2017), and our analysis operates on much fewer training sightlines (2 × 10 5 compared to TDCOSMO's 10 9 ).ing density in this region of high κ affects the summary statistics methods more directly, because they run out of training samples to match.See Figure 10 for a visualization of the κ distribution in Experiment C4 compared to that of the training set.
Even at this breaking point, however, the true µ test falls within the BGNN's 2σ credible interval.For the summary statistics, the downward biases on µ test become more dramatic than in Experiments C3 or B3.The uncertainties are underestimated for the BGNN in this regime, resulting in overestimation of σ test .The summary statistics methods do recover σ test within their 2σ credible interval, however.
Based on a simulated dataset consisting of 200,000 sightlines following N (0.01, 0.04), we have demonstrated the ability of BGNNs to effectively model the mapping between photometric observations and the continuous κ space.Summary statistics matching, on the other hand, operates on discrete samples so cannot extrapolate beyond the κ samples present in the dataset.When there are only 200,000 sightlines, the poor matching causes posteriors to revert to the prior, and when the prior is shifted from the actual population, bias results.The BGNN also tends toward the training distribution when trained on a small dataset and also leads to bias, but it is more robust to the train-test mismatch.For both approaches, we expect the hierarchical inference performance to improve with bigger training sets that better sample high-and low-κ environments.Further tests are needed to probe the relative performance of the BGNN and summary statistics methods with increasing training sets.We postpone this exercise to future work.We emphasize that the comparison between BGNN and summary statistics presented in this work is not to be viewed as an evaluation of the TDCOSMO analysis of κ ext .We include the comparison to highlight the difference between a neural net that operates on the continuous κ space and an ABC method that either accepts or rejects discrete samples.Our implementation of the summary statistics matching differs in significant ways from the TDCOSMO implementation.TDCOSMO uses orders of magnitude more sightlines (∼billion compared to our 200,000) (e.g.Greene et al. 2013;Rusu et al. 2017).Their sightlines follow the phenomenological distribution from the Millennium Simulation which extends out to high κ.Thanks to the dataset size, they are able to match to both N sim and N 1/r,sim , whereas we have matched to either N sim or N 1/r,sim in our work.Additionally, TDCOSMO incorporates external shear information from the lens modeling to further constrain their κ ext .This could not be done here, because we focused on sightlines without strong lensing.
A key takeaway from our toy hierarchical tests is the importance of systematically testing our assumptions about the κ prior.Regardless of the κ inference algorithm used, neural network or ABC-based, we must infer the target κ population and account for any difference from the prior.If we instead derive κ posteriors from individual sightlines based on a chosen κ prior and let them independently enter the joint downstream H 0 analysis, we effectively multiply by the prior N times, where N is the number of sightlines.As N increases, the particular choice of κ prior will become more significant.

Computation time
The total computation time of our hierarchical κ inference pipeline can be broken down into the BGNN training/inference time and the hiearchical inference time.Training the BGNN with the configuration detailed in Section 3.2 took about 12 hours.The model was trained for 118 epochs on an Intel Xeon Gold 6148 GPU on the Cori system available at the National Energy Research Scientific Computing Center (NERSC).Generating BGNN predictions on the 1,000 sightlines in each test set took seconds.For a given training set size, the BGNN training and inference time can be considered fixed with respect to the number of test sightlines.The last step of hierarchical inference involves MCMC sampling, which took 2-3 hours to converge on 4 CPU cores for 1,000 sightlines.In our current implementation, the evaluation of the MCMC objective scales linearly with the number of sightlines in the test set.The summary statistics matching took 30 minutes for 1,000 sightlines on 4 CPU cores, for either N sim or N 1/r,sim .In our current implementation, the matching time scaled linearly with the number of sightlines and with the size of the "closeness threshold grid."For each sightline, we carried out matching according to each threshold on this prespecified grid and took the smallest threshold with more than 100 matched samples at the end.We used a grid of size 20 for both N sim and N 1/r,sim .In the hierarchical inference step, the MCMC sampling with the summary statistics likelihoods also took 2-3 hours to converge on 4 CPU cores for 1,000 sightlines -similarly as with the BGNN likelihoods.
To generate our datasets, we performed additional raytracing on top of the CosmoDC2 values, as detailed in Section 6.4.1, to compute the κ labels and queried the galaxy catalog to construct the input.We simulated 850,000 sightlines and subsampled smaller Gaussian subsets to conduct our analysis.Out of the 850,000 sightlines, we reserved 500,000 sightlines for subsampling the Gaussian training set of 200,000 sightlines and used the remaining 250,000 sightlines to subsample the Gaussian validation and test sets.The raytracing took 85 hours for 850,000 sightlines on 18 cores.The input construction took 102 hours for the same sightlines.

CONCLUSION
In this paper, we introduced a novel graph-based neural network architecture that can infer κ from any number of observable galaxy properties and a first hierarchical pipeline for constraining the population κ statistics.To probe a range of numerics in the recovery of hyperparameters in the target population, we have designed toy testbeds where the training κ distribution (our prior) and test populations are mismatched to varying degrees.We present comparisons between BGNN and summary statistics matching throughout to showcase the distinction between a neural network that operates in a con- Accurate µtest recovery is critical to unbiased H 0 inference and the impact of σtest recovery is second-order (Section 2).The BGNN can accurately recover the µtest, σtest within its 2σ credible interval.Both the N sim and N 1/r,sim summary statistics lead to downward biases on µtest.Note that N sim and N 1/r,sim are simplified versions of the summary statistics employed by TDCOSMO (e.g.Rusu et al. 2017), and our analysis operates on much fewer training sightlines (2 × 10 5 compared to TDCOSMO's 10 9 ).tinuous κ space and an ABC-based method that either accepts or reject discrete samples.Note, however, that our implementation of summary statistics matching does not represent TDCOSMO's, which is more advanced and factors in more information.
We conclude that: • Our BGNN can yield accurate and precise κ posterior PDFs by processing all available photometric observations, e.g. the positions and magnitudes of individual galaxies around a line of sight.On average, it is 60% more accurate than matching number counts N sim and 30% more accurate than matching distance-weighted number counts N 1/r,sim .
• When propagated into hierarchical inference, the BGNN-inferred posterior PDFs lend themselves to precise and accurate recovery of hyperparameters for a range of test populations shifted from the training distribution.On a population fully encompassed by the training distribution κ ∼ N (0, 0.02), the BGNN can recover the population mean on κ precisely and accurately, translating into only a 0.1% 1σ uncertainty contribution to H 0 without evidence of bias.The N sim and N 1/r,sim summary statistics, on the other hand, underestimate κ, which would lead to a slight upward bias on H 0 even on this representative test set.
• Both the BGNN and the simplified summary statistics methods became more biased as the test pop-

BGNN Truth
Fig. 8.-Constraints on the population hyperparameters for the narrow, extremely overdense test set (Experiment C4) with truth µtest = 0.08 and σtest = 0.005.Contours are 68% and 95% credible intervals.Accurate µtest recovery is critical to unbiased H 0 inference and the impact of σtest recovery is second-order (Section 2).The BGNN can still recover the µtest within its 2σ credible interval, but the σtest is overestimated by a factor of 15.Both the N sim and N 1/r,sim summary statistics lead to severe downward biases on µtest but do recover σtest within their 2σ credible interval.s.Note that N sim and N 1/r,sim are simplified versions of the summary statistics employed by TDCOSMO (e.g.Rusu et al. 2017), and our analysis operates on much fewer training sightlines (2 × 10 5 compared to TDCOSMO's 10 9 ).
ulations sample the tails of the training distribution.The BGNN is found to be more robust to this training-test mismatch, however; on a particularly challenging shifted and narrow population with κ ∼ N (0.04, 0.005), the bias on H 0 was 0.8% for the BGNN as compared to 2% and 1.5% for the N sim and N 1/r,sim summary statistics, respectively.
• Based on a simulated dataset consisting of 200,000 sightlines following N (0.01, 0.04), we have demonstrated the ability of BGNNs to effectively model the mapping between photometric observations and the continuous κ space.On the other hand, the simplified summary statistics matching as implemented in our paper cannot extrapolate beyond the discrete κ samples in the dataset, resulting in poor generalizability and accuracy.For both approaches, we expect the hierarchical inference performance to improve with bigger training sets that better sample high-and low-κ environments.Further tests are needed to probe the relative performance of the BGNN and summary statistics methods with bigger training sets.
We have validated the BGNN method of κ inference within the hierarchical Bayesian framework using simplified, well-controlled experiments.Being the first to infer the population κ statistics, we took a pedagogical angle and focused on "toy testbeds" for our method.We as-sumed convenient Gaussian parameterizations throughout in order to conduct simple recovery tests, but it remains to probe the numerics involved with inferring non-Gaussian population κ distributions, particularly those with a right skew.In addition, our test sets were drawn from the same simulation as the training set.In order to assess the impact of training-test mismatches in the assumed galaxy-halo connection, it will be important to test on sightlines derived from other simulations, such as the Millennium Simulation (Springel et al. 2005;Hilbert et al. 2009).Ultimately, we can better control the associated bias by inferring extra hyperparameters governing various aspects of the galaxy-halo connection.These extensions and further validation tests are not limited to the BGNN method and apply for the summary statistics matching (ABC methods) as well.
Given the discretized deflection angles α (k) (θ (k) ) , we can raytrace the observed angular position θ (0) back through the series of lens planes: for k = 1, • • • , K. In particular, for the source located at shell k = K, we have Naively applying the iterative scheme in Equation 37 is computationally prohibitive for large K.For deep, full-sky simulations, there exist iterative techniques that reduce the number of arithmetic operations and memory (Seitz & Schneider 1994;Jain et al. 2000;Hilbert et al. 2009;Hartlap 2005;McCully et al. 2014).
6.3.Lensing quantities Our target lensing quantities are the convergence and shear, which describe the strength and direction, respectively, of the lensing effect along the entire line of sight.They can be identified in the distortion matrix Γ, obtained by differentiating Equation 33 with respect to θ (Schneider et al. 1992): The lensing distortion can thus be decomposed into an isotropic change in area, parameterized by the convergence κ = κ(θ), and an area-preserving change in shape, parameterized by the shear components γ 1,2 = γ 1,2 (θ) forming the complex shear γ = γ 1 + iγ 2 .In particular, ψ satisfies the Poisson equation, i.e.
Predicting the lensing signal of halos entails relating their mass density to κ.In the Born approximation limit, κ is the weighted sum of the projected surface mass densities of individual halos.Denoting the three-dimensional mass density of the individual halos, indexed by n, as ρ n (η, θ), we have Note that the Born approximation is not strictly valid in the strong lensing regime, as the light paths undergo significant perturbations (Birrer et al. 2017;Bar-Kana 1996).

Raytracing numerics
Note that κ and γ 1,2 contain second-order gradients of the lensing potential ψ.In full-sky simulations, we can obtain these vector fields on the sphere by generating maps of the gradients with respect to the spherical harmonics.The computation can be performed rapidly using FFTs on sky pixelization schemes such as HEALPix (Hockney & Eastwood 1988;Lewis 2005).The deflection field α(θ) and distortion matrix Γ(θ) are first evaluated on a mesh grid, after which multiple rays can be traced backward in parallel.The choice of mesh spacing, which we will call δ, determines the spatial resolution of the projected matter density.
In principle, all halos in the foreground of the source contribute to lensing, but the effect becomes smaller with increasing projected distance from the source.A cutoff in the multipole (l max ) smoothes out the effects within these angular scales.The cutoff determines the "field of view" of the light cone, or the sky area around an angular position θ that contributes lensing potential to the calculation of Γ(θ ).Let us denote the average field of view as D. CosmoDC2 uses l max = 8000, which corresponds to D ∼ 1.5 .
For smaller light cones with D spanning a few arcminutes, flat-sky approximations may suffice.The field of view is divided into a rectilinear grid, from which the gradients are computed numerically using finite differences.Though not exact, an analog of δ in this approximation will be the grid spacing.Similarly, the size of the flat field is akin to D operationally defined above.6.4.1.Structure-enhanced raytracing Defining the gravity-only component of our mock Universe is a large-scale cosmological N-body simulation called the Outer Rim (Heitmann et al. 2019).It sampled a volume of (4.225 Gpc) 3 with 10,240 3 particles, yielding a mass resolution of m p = 2.6 × 10 9 M .The CosmoDC2 galaxy catalog was built on top of the Outer Rim particles using semi-analytic galaxy models (Korytov et al. 2019b).
In addition to galaxy and halo information, CosmoDC2 characterizes each galaxy with weak lensing shear and convergence.The lensing pipeline of CosmoDC2 computed these quantities using a multi-plane raytracing algorithm outlined in Section 6.In particular, the surface densities were evaluated on a HEALPix (Gorski et al. 2005) grid of NSIDE=4096, which corresponds to a resolution of δ ∼ 51 (Larsen et al. 2019).
For galaxy-scale strong lensing studies, we instead require lensing statistics that vary on smaller, 1 scales.At the same time, we must include the impact of large-scale structure at large angular scales, which are already captured in the CosmoDC2 κ, γ 1,2 values.To obtain the structure-enhanced κ ext , we raytraced through the Outer Rim halos along the sightline of each source galaxy at a resolution of δ = 1 , while subtracting out the effect of any extra mass we were adding in the process.This subtraction scheme ensures that the mean curvature of the universe equals the one imposed by the background, which is true if and only if the mean convergence of all angular directions to all redshifts is zero (Birrer et al. 2017).That is, our structure-enhanced convergence values must satisfy dη dθκ(θ; η) = 0, (41) where κ(θ; η) denotes the convergence evaluated at the angular position θ and comoving distance η.We assume that the CosmoDC2 convergence values already satisfy this condition.In the following, we describe the process of structure-enhanced raytracing in detail.
Let us index each sightline by s.For a given source galaxy at angular position θ s and redshift z s defining the sightline, we first queried halos with redshifts z < z s and masses M 200 > 10 11 M located inside an aperture of fixed radius 1.5 .These halos constituted the light cone.The aperture size was chosen to match the CosmoDC2 field of view, D ∼ 1.5 .The mass cut is a numerical choice, built in to reduce the computation time raytracing over halos with insignificant masses.We determined the mass threshold of 10 11 M via a convergence test; it was increased from the base value of 10 10 M until the κ ext value calculated within the aperture remained constant.On average, there were about 500 halos per light cone.
We then performed full multi-plane raytracing from the source plane z s through each of the deflector halo planes using the strong lensing simulation package Lenstronomy (Birrer & Amara 2018).Surface densities were evaluated at a resolution of 1 , representing a structure enhancement by a factor of ∼ 51 compared to CosmoDC2.We then evaluated the convergence at the center of our aperture, which we denote κ render,s .
Adding κ render,s to the original CosmoDC2 convergence κ DC2,s would capture both small-and large-scale fluctuations, as desired, but erroneously double count the Outer Rim particles already included in the κ DC2,s computation.To preserve the mean mass in the universe, we employ the following calibration scheme.We render the halos at random positions within the aperture, compute the resulting convergence at the aperture center for each realization i which we denote κ (i) s , and compute the mean across N realizations: Moving the halos around corresponds to a Monte Carlo integration over all angular positions θ, i.e. dθ ≈ 1 N N i .Our final, calibrated convergence value for a sightline s results from subtracting off this quantity: κ s = κ DC2,s + κ render,s − < κ (i)  s > i .
Assuming that we sample our sightlines uniformly across the sky, we have s ∝ dθ and since we assumed that the CosmoDC2 values satisfied Equation 41, we can claim To generate the halo catalogs, Korytov et al. (2019a) (henceforward "CosmoDC2 team") ran a parallel, tree-based friends-of-friends (FOF) halo finder on the Outer Rim simulation with dimensionless linking length b = 0.168 and minimum requirement of 20 particles per halo.All particles, not just those in the original FOF halo, were counted in radial shells centered on the point of minimum potential.They then constructed halo merger trees using a particlemembership algorithm (Rangel et al. 2017).Their halo light cone was the product of tiling the simulation box in space to build a greater volume and applying a parallel solver that linearly interpolated halo positions between adjacent snapshot positions.The light cone filled one octant (∼ 5,000 deg 2 ) of the sky and had a depth of z = 3.Although the original CosmoDC2 lensing pipeline evaluated surface densities directly from the Outer Rim particles, we take as input the auxiliary halo catalog generated from running a halo finder on the particles.We treat each halo in the

BGNN 2 .
Fig. 1.-Left:Our BGNN takes as input the photometric measurements, d (i) , of galaxies comprising a test sightline i and outputs the posterior on κ, p(κ (i) |d (i) ).After repeating this for all the test sightlines, we combine the posteriors into a hierarchical Bayesian model to constrain the µtest, σtest governing the test population.Right: simplified dependence relations shown as PGMs.Dots denote fixed values, shaded ovals observations, and unshaded ovals random variables.During training set generation, the set of assumptions governing the training set, Ω train , are used to realize specific values of {z, M } and κ for individual sightlines.The photometric observations of the field of view for that sightline, d, depend on κ, among other quantities not shown.During inference, the trained BGNN infers κ from d.The resulting posteriors for all test sightlines are used to recover µtest, σtest of the test population.
Fig. 2.-Illustration of κ in the training sightlines.Each dot is a galaxy meeting the photometric cut i < 25.3.Colors of dots indicate redshift z and sizes of dots increase linearly with decreasing i-band magnitude.From left to right, the true κ increases at approximately the centers of test sets C1 through C4 (-0.04, 0, 0.04, 0.08).The unweighted number counts N sim and distance-weighted number counts N 1/r,sim (Equation20) also generally increase with increasing κ.The source was located at z = 2 for all the sightlines.Best viewed in color.
Network architecture A diagram of our network architecture is shown in Figure 3.Our network takes as input the features of V nodes comprising a line of sight, denoted {x Fig. 3.-Our network takes as input the features of V nodes comprising a line of sight, denoted {x (0) 0 , • • • , x (0) V }, and passes through five residual blocks.In each residual block b, the local (node) encodings {x (b−1) 0 , • • • , x (b−1) V } and global (graph) encoding u (b−1) are processed together through a series of residual MLPs to yield updated local and global encodings.The final encodings enter the projector MLP, which produces the posterior PDFs over the local target quantities (z, M for each node) and the global target quantity κ.
Fig. 5.-The binned κ recovery based on the likelihoods modeled by BGNN and the two summary statistics.The dots are computed from Equation 25 and the sizes of the error bars from Equation27.The BGNN predictions are consistent with the true κ for all κ values.On the other hand, both the N sim and N 1/r,sim summary statistics reveal signs of upward bias in κ < −0.04 and downward bias in κ > 0.09.The bin count was 148 for all bins.Note that N sim and N 1/r,sim are simplified versions of the summary statistics employed by TDCOSMO (e.g.Rusu et al. 2017), and our analysis operates on much fewer training sightlines (2 × 10 5 compared to TDCOSMO's 10 9 ).
Fig.6.-Constraints on the population hyperparameters for the narrow, overdense test set (Experiment C3) with truth µtest = 0.04 and σtest = 0.005.Contours are 68% and 95% credible intervals.BGNN can accurately recover the µtest, σtest within its 1σ credible interval.Both the N sim and N 1/r,sim summary statistics lead to downward biases on µtest.They also overestimate σtest.Note that N sim and N 1/r,sim are simplified versions of the summary statistics employed by TDCOSMO (e.g.Rusu et al. 2017), and our analysis operates on much fewer training sightlines (2 × 10 5 compared to TDCOSMO's 10 9 ).
Fig. 7.-Constraints on the population hyperparameters for the broad, overdense test set (Experiment B3) with truth µtest = 0.04 and σtest = 0.02.Contours are 68% and 95% credible intervals.Accurate µtest recovery is critical to unbiased H 0 inference and the impact of σtest recovery is second-order (Section 2).The BGNN can accurately recover the µtest, σtest within its 2σ credible interval.Both the N sim and N 1/r,sim summary statistics lead to downward biases on µtest.Note that N sim and N 1/r,sim are simplified versions of the summary statistics employed by TDCOSMO (e.g.Rusu et al. 2017), and our analysis operates on much fewer training sightlines (2 × 10 5 compared to TDCOSMO's 10 9 ).
Fig. 10.-Visualization of the test set distributions.A (dashed) refers to the training set distribution.The B group of experiments (blue, solid) probe varying µtest with broad σtest, and likewise with the C group (black, solid) with narrow σtest.Note that C4, in particular, has very little overlap with the training set.
where p(µ test , log σ test ) is our hyperprior and i indexes the test set.Note that we work in the natural log space of σ test to restrict our sampling to positive values.train ) assigns more relative importance to κ underrepresented by the training set.Note that division by zero will result whenever p(κ|µ test , σ test ) assigns density to a κ value not encompassed by p(κ|Ω train test , σ test , we first draw M samples from the individual posteriors conditioned on the training set.We can do this efficiently with a BGNN or use the matched samples themselves with the summary statistics method.We then evaluate the density of the training prior p(κ|Ω train ) and our target Gaussian distribution at those κ samples and take the mean of the density ratio across samples.The weighting by 1/p(κ|Ω The binned κ recovery based on the likelihoods modeled by BGNN and the two summary statistics.The dots are computed from Equation Summary of test sets defined by varying the input µtest, σtest All of the test sets had N =1,000 sightlines.

TABLE 3 Metrics
Evaluating κ Recovery on Individual Test Sightlines for Each κ Bin a Values listed are the median of the metrics across sightlines in a given bin.Errors listed are the standard deviations in a given bin.bDefined in Equation 28.Higher is better.cDefined in Equation29.Lower is more accurate.dDefinedin Equation30.Lower is more precise.

TABLE 4
Recovery of population κ and impact on H 0 error budget Primary contribution to the fractional H 0 error budget based on µtest constraints, defined in Equation11.bBias introduced in H 0 due to bias on µtest, defined in Equation13. a