Fast Globally Optimal Catalog Matching using MIQCP

Jacob Feitelberg; Amitabh Basu; Tamás Budavári

doi:10.3847/1538-3881/acf5e2

1. Introduction

A cornerstone of modern astronomy is ambitious systematic surveys by dedicated telescopes with well-understood selection functions. Their imaging data collected over many years and a wide range of wavelengths are summarized in catalogs for most astronomy research projects. To enable time-domain and multiwavelength studies, several tools have been developed to combine or crossmatch the catalogs, such as TOPCAT (Taylor 2005), the CDS XMatch (Pineau et al. 2015) portal, and ASPECTS (Fioc& Michel 2014). Most of these, however, rely on some sort of heuristics and do not take into account the inherent statistical nature and computational challenge of the global catalog matching problem.

Budavári & Szalay (2008) approached the problem through Bayesian hypothesis testing. Their methodology was implemented into SkyQuery (Dobos et al. 2012), a service part of the SciServer Science Platform (Taghizadeh-Popp et al. 2020). Budavári & Basu (2016) solved the problem globally for the simpler two-catalog case by reformulating the problem as a linear assignment problem. Their method maximized the marginal likelihood of a crossmatching using the Hungarian algorithm (Kuhn 1955). Shi et al. (2019) extended the method to handle more than two catalogs by enumerating all subsets of the data set (all possible associations to hypothesized astronomical objects) and solving the problem using mixed integer linear programming (MILP). Nguyen et al. (2022) modified the MILP formulation to improve its speed for a larger number of catalogs. They applied their method over small sections of (simulated) catalogs called islands created by a DBSCAN (Ester et al. 1996) clustering procedure, which guaranteed that every such cluster or island was far away from each other. However, limited by the scaling of their algorithmic solution, Nguyen et al. (2022) could not process more than 20 catalogs in an island in under 45 minutes on our standard desktop computing hardware described in Section 5.

In this paper, we introduce a significantly faster, yet exact, approach to crossmatching that improves globally optimal algorithmic solutions by reformulating the problem using quadratic constraints. Section 2.1 introduces the Bayesian hypothesis testing framework we use. Section 2.2 explains the methods that we build upon. Section 3 explains our new approach using quadratic constraints. Section 4 puts forward numerical accuracy improvements and an algorithm to provide a bound of the maximum number of objects in a data set. Section 5 demonstrates empirically that our new approach is superior to prior methods using simulated catalogs. Section 6 demonstrates our new method on real-world data from the Hubble Source Catalog. Section 7 provides concluding remarks and discusses possible future improvements.

2. Globally Optimal Matching

The goal of astronomical catalog crossmatching is to best match sources from different catalogs to the same astronomical objects. We follow the Bayesian hypothesis testing framework introduced by Budavári & Szalay (2008), which considers the marginal likelihood of possible competing associations. At the heart of the approach is the member likelihood function that is the probability density of the measurement for each source i in catalog c given a hypothesized true direction ω and uncertainty. The data for source (i,c), represented hereafter by D_ic, is often summarized into an estimated direction x_ic, and the member likelihood function is often assumed to be a Gaussian in the tangent plane of the observation. Our goal is to optimally match all available ic detections as appropriate. Below we formulate the problem in general and discuss the specializations for the usual application limits.

2.1. Problem Formulation

To formalize the problem, we follow the notation in Nguyen et al. (2022). Let there be C catalogs with each catalog indexed c ∈ {1,...,C}. We subscript variables with ic as shorthand for (i,c) to denote a variable associated with source i in catalog c. Suppose there are N_c sources in each catalog with D_ic denoting the source's raw imaging data. Each source has a corresponding true direction ω it comes from and likelihood

$\begin{eqnarray}&&{{\ell }}_{{ic}}(\omega )=p({D}_{{ic}}| \omega ).\end{eqnarray} \tag{ 1 }$

This framework allows for other properties, such as color and brightness, to be considered, which is explored in Salvato et al. (2018) and Marquez et al. (2014). However, we choose to focus on directional data only.

Crossmatching is equivalent to partitioning of the sources where no sources in the same catalog are assigned to the same subset. Given a partition ${ \mathcal P }=\{{S}_{1},{S}_{2},\ldots ,{S}_{{N}_{\mathrm{obj}}}\}$ , each nonempty subset in the partition, indexed by o ∈ O = {1,..., N_obj}, represents an astronomical object. Each astronomical object also has additional properties such as an unknown true direction and spectrum. In the rest of the paper, we use object as shorthand for a nonempty subset in a partition of sources. We denote each source's directional uncertainty as σ_ic. Under this framework, we seek to find the partition with maximum likelihood.

The likelihood of a partition ${ \mathcal P }$ is the product of conditionally independent likelihoods

$\begin{eqnarray}&&{ \mathcal L }({ \mathcal P })=p({{\boldsymbol{D}}}_{11},\ldots ,{{\boldsymbol{D}}}_{{ic}},\ldots ,{{\boldsymbol{D}}}_{{N}_{c},C}| { \mathcal P })=\displaystyle \prod _{o\in O}{{ \mathcal M }}_{o},\end{eqnarray} \tag{ 2 }$

which we aim to maximize. Here, ${{ \mathcal M }}_{o}$ is the marginal likelihood of an association corresponding to an object o and is calculated via

$\begin{eqnarray}&&{{ \mathcal M }}_{o}=\int {\rho }_{{C}_{o}}(\omega )\displaystyle \prod _{{ic}\in {S}_{o}}{{\ell }}_{{ic}}(\omega )\,d\omega ,\end{eqnarray} \tag{ 3 }$

where C_o:={c: ic ∈ S_o} is the subset of catalogs associated with object o, and ${\rho }_{{C}_{o}}(\omega )$ is the prior probability density function of the object direction creating the sources within each catalog of C_o. For convenience, we also define the marginal likelihood of a nonassociation for an object's sources as

$\begin{eqnarray}&&{{ \mathcal M }}_{o}^{\mathrm{NA}}=\displaystyle \prod _{{ic}\in {S}_{o}}\int {\rho }_{c}(\omega ){{\ell }}_{{ic}}(\omega )\,d\omega ,\end{eqnarray} \tag{ 4 }$

where ρ_c(ω) is the prior probability density function of the sources in catalog c. For a detailed discussion on this notation and its assumptions, see Nguyen et al. (2022).

For an object in a partition, the Bayes factor is the ratio of the marginal likelihood of the object association and the marginal likelihood of the nonassociation of the sources in the object:

$\begin{eqnarray}&&{B}_{o}=\displaystyle \frac{{{ \mathcal M }}_{o}}{{{ \mathcal M }}_{o}^{\mathrm{NA}}}.\end{eqnarray} \tag{ 5 }$

When the Bayes factor is greater than 1, the hypothesis for association is more likely than the hypothesis for nonassociation. When the Bayes factor is less than 1, the hypothesis for nonassociation is more likely than the hypothesis for association. We note that $\prod {{ \mathcal M }}_{o}^{\mathrm{NA}}$ is independent of the proposed partition, i.e., constant, hence optimizing $\prod {{ \mathcal M }}_{o}$ is equivalent to maximizing ∏B_o.

We summarize a source's imaging data, D_ic, with its measured direction and directional uncertainty. To model the uncertainty in the measured directions, we use the simplest spherical analog of the normal distribution known as the Fisher (1953) distribution:

$\begin{eqnarray}&&{{\ell }}_{{ic}}(\omega ):=f({x}_{{ic}};{\omega }_{{ic}},{\kappa }_{{ic}})=\displaystyle \frac{{\kappa }_{{ic}}}{4\pi \sinh ({\kappa }_{{ic}})}\exp ({\kappa }_{{ic}}\omega \cdot {x}_{{ic}}),\end{eqnarray} \tag{ 6 }$

where κ_ic is a concentration parameter. When κ_ic ≫ 1, the Fisher distribution is approximately equal to a two-dimensional normal distribution in the tangent plane with variance ${\sigma }_{{ic}}^{2}$ where ${\kappa }_{{ic}}=1/{\sigma }_{{ic}}^{2}$ . We denote κ_ic in arcseconds⁻² as ${\kappa }_{{ic}}^{\sec }$ and denote κ_ic in radians⁻² without the superscript. Let ${ \mathcal R }\,={(3600\cdot 180)}^{2}/{\pi }^{2}\,{\mathrm{arcseconds}}^{2}$ to convert from arcseconds⁻² to rad⁻². We denote any κ in arcseconds⁻² as ${\kappa }^{\sec }$ and any κ in radians⁻² without the superscript. So, we have

$\begin{eqnarray}&&\kappa ={ \mathcal R }{\kappa }^{\sec }.\end{eqnarray} \tag{ 7 }$

Using this flat-sky approximation, Budavári & Szalay (2008) showed that an object's Bayes factor can be written as

$\begin{eqnarray}&&{B}_{o}={2}^{| {S}_{o}| -1}\displaystyle \frac{{\prod }_{{ic}}{\kappa }_{{ic}}}{{\sum }_{{ic}}{\kappa }_{{ic}}}\exp \left(-\displaystyle \frac{{\sum }_{{ic}}{\sum }_{i^{\prime} c^{\prime} }{\kappa }_{{ic}}^{\sec }{\kappa }_{i^{\prime} c^{\prime} }^{\sec }{\psi }_{{ic},i^{\prime} c^{\prime} }^{2}}{4{\sum }_{{ic}}{\kappa }_{{ic}}^{\sec }}\right),\end{eqnarray} \tag{ 8 }$

where ∣S_o∣ is the number of sources in an object and ${\psi }_{{ic},i^{\prime} c^{\prime} }$ is the squared distance between sources ic and $i^{\prime} c^{\prime}$ . Putting this together, our objective is to maximize

$\begin{eqnarray}&&\displaystyle \prod _{o\in O}{B}_{o}\end{eqnarray} \tag{ 9 }$

over the space of feasible partitions. Equivalently, we can minimize the negative log of the Bayes factors product

$\begin{eqnarray}&&\displaystyle \sum _{o\in O}-\mathrm{ln}({B}_{o}),\end{eqnarray} \tag{ 10 }$

where

$\begin{eqnarray}&&\begin{array}{l}-\mathrm{ln}({B}_{o})=(1-| {S}_{o}| )\mathrm{ln}(2)-\displaystyle \sum _{{ic}}\mathrm{ln}({\kappa }_{{ic}})\\ +\mathrm{ln}\left(\displaystyle \sum _{{ic}}({\kappa }_{{ic}})\right)+\displaystyle \frac{{\sum }_{{ic}}{\sum }_{i^{\prime} c^{\prime} }{\kappa }_{{ic}}^{\sec }{\kappa }_{i^{\prime} c^{\prime} }^{\sec }{\psi }_{{ic},i^{\prime} c^{\prime} }^{2}}{4{\sum }_{{ic}}{\kappa }_{{ic}}^{\sec }}.\end{array}\end{eqnarray} \tag{ 11 }$

Optimizing the log Bayes factor instead of the Bayes factor itself allows us to optimize over a sum instead of a product. Optimizing over a sum allows the use of fast mathematical programming methods.

2.2. Prior State of the Art

Budavári & Lubow (2012) proposed optimizing the Bayes factor by creating a fully connected graph between sources and removing the largest edge until the (logarithm) Bayes factor was optimal. Their Chainbreaker method is essentially equivalent to single-linkage clustering. However, Chainbreaker provides only an approximate solution and does not enforce the constraint that two sources from the same catalog are not associated with each other. For two catalogs, Budavári & Basu (2016) solved the problem in polynomial time using the Hungarian algorithm. However, for three or more catalogs, the problem becomes NP-Hard. Shi et al. (2019) formulated the problem using mixed integer linear programming. However, their formulation enumerated all subsets of of the source data set, which grows exponentially fast with the number of catalogs and sources. We denote the data set of source-catalog pairs as D. Nguyen et al. (2022) reformulated the problem by assigning sources to hypothetical objects. This reformulation reduced the size of the formulation from exponential to polynomial. However, their method was still not able to process more than 20 catalogs in a reasonable amount of time. Following the naming convention in Nguyen et al. (2022), we refer to the method by Shi et al. (2019) as CanILP and the method by Nguyen et al. (2022) as DirILP. We build upon ideas in the DirILP formulations, hence we describe the relevant components next.

2.3. The DirILP Solver

We start with a concise review of DirILP and its parts relevant to our improved method; see Appendix in Nguyen et al. (2022) for a full description. DirILP directly assigns sources to hypothetical objects rather than enumerating all possible subsets like in CanILP. Let N be the maximum number of objects. They introduced binary variables x_ic ^o for each source-catalog pair ic ∈ D and hypothetical object o ∈ {1,..., N} where ${x}_{{ic}}^{o}=1$ if ic is associated with object o and 0 otherwise. So, a subset S_o in a partition ${ \mathcal P }$ can be represented as all sources where ${x}_{{ic}}^{o}=1$ . To capture the number of sources assigned to an object, for each k ∈ {0,...,C}, they introduced binary variables

$\begin{eqnarray}&&{z}_{k}^{o}=\left\{\begin{array}{l}1\quad \mathrm{if}\,\displaystyle \sum _{{ic}}{x}_{{ic}}^{o}=k\\ 0\quad \mathrm{else}\end{array}\right..\end{eqnarray} \tag{ 12 }$

DirILP also introduced binary variables

$\begin{eqnarray}&&{y}_{{ic},i^{\prime} c^{\prime} }^{o}=\left\{\begin{array}{l}1\quad \mathrm{if}\,{x}_{{ic}}^{o}={x}_{i^{\prime} c^{\prime} }^{o}=1\\ 0\quad \mathrm{else}\end{array}\right.,\end{eqnarray} \tag{ 13 }$

which indicates if two sources from different catalogs c and $c^{\prime}$ are assigned to the same object.

To linearize the term $\mathrm{ln}\left({\sum }_{{ic}}{\kappa }_{{ic}}\right)$ in the negative log Bayes factor formula, Nguyen et al. created a piecewise linear approximation. First, they introduced constant terms ${b}_{\min }:={b}_{1},{b}_{2},{b}_{3},\ldots ,{b}_{R}=: {b}_{\max }$ where

$\begin{eqnarray}&&\begin{array}{l}{b}_{\min }=\mathrm{ln}\left(\mathop{\min }\limits_{{ic}\in D}{\kappa }_{{ic}}\right),\\ {b}_{\max }=\mathrm{ln}\left(C\mathop{\max }\limits_{{ic}\in D}{\kappa }_{{ic}}\right).\end{array}\end{eqnarray} \tag{ 14 }$

Next, they set an error threshold and set

$\begin{eqnarray}&&R:=\,\lceil \displaystyle \frac{{b}_{\max }-{b}_{\min }}{\epsilon }\rceil ,\end{eqnarray} \tag{ 15 }$

and the list of constants are defined for p = 1, ..., R as

$\begin{eqnarray}&&{b}_{p}={b}_{\min }+(p-1)\epsilon .\end{eqnarray} \tag{ 16 }$

Next, they defined for each object o binary variables ${\chi }_{1}^{o}\geqslant {\chi }_{2}^{o}\,\geqslant ...\geqslant \,{{\boldsymbol{\chi }}}_{R}^{o}$ and imposed a new constraint

$\begin{eqnarray}&&{\chi }_{1}^{o}\exp ({b}_{1})+\displaystyle \sum _{p=2}^{R}{\chi }_{p}^{o}\left[\exp ({b}_{p})-\exp ({b}_{p-1})\right]\geqslant \displaystyle \sum _{{ic}}{\kappa }_{{ic}}{x}_{{ic}}^{o}.\end{eqnarray} \tag{ 17 }$

So, the new variables provide an approximation to the original term:

$\begin{eqnarray}\begin{array}{rcl}\mathrm{ln}\left(\displaystyle \sum _{{ic}}{\kappa }_{{ic}}\right) & \approx & {\chi }_{1}^{o}{b}_{1}+\displaystyle \sum _{p=2}^{R}{\chi }_{p}^{o}({b}_{p}-{b}_{p-1})\\ & = & {\chi }_{1}^{o}{b}_{\min }+\epsilon \displaystyle \sum _{p=2}^{R}{\chi }_{p}^{o}.\end{array}\end{eqnarray} \tag{ 18 }$

For an example of how these variables work see Section 2.2.2 in Nguyen et al. (2022). To linearize the $(1-| {S}_{o}| )\mathrm{ln}(2)$ term in the objective function, they introduced new variables p^o for each object where

$\begin{eqnarray}{p}^{o}=\left\{\begin{array}{ll}(1-k)\mathrm{ln}(2)\quad & \mathrm{if}\,{z}_{k}^{o}=1\,\mathrm{for}\ \mathrm{some}\,k\in [C]\\ 0\quad & \mathrm{else}\end{array}\right..\end{eqnarray} \tag{ 19 }$

Finally, to linearize the double sum in the objective function

$\begin{eqnarray}&&\displaystyle \frac{{\sum }_{{ic}}{\sum }_{i^{\prime} c^{\prime} }{\kappa }_{{ic}}^{\sec }{\kappa }_{i^{\prime} c^{\prime} }^{\sec }{\psi }_{{ic},i^{\prime} c^{\prime} }^{2}}{4{\sum }_{{ic}}{\kappa }_{{ic}}^{\sec }},\end{eqnarray} \tag{ 20 }$

they introduced new constants

$\begin{eqnarray}&&\begin{array}{l}{c}_{\min }:=\mathop{\min }\limits_{{ic}\in D}{\kappa }_{{ic}}^{\sec },\\ {c}_{\max }:=C\mathop{\max }\limits_{{ic}\in D}{\kappa }_{{ic}}^{\sec },\end{array}\end{eqnarray} \tag{ 21 }$

and grid points 0:=c₀, c₁,...,c_Q, where c₁ is c_min rounded to the nearest 100 and c_Q is c_max rounded to the nearest 100. Then, ∀ i > 2, c_i := c₁ + 100(i − 1). Then they introduced new variables u_k ^o for each object where

$\begin{eqnarray}{u}_{k}^{o}=\left\{\begin{array}{ll}1\quad & \mathrm{if}\,\displaystyle \sum _{{ic}}[{\kappa }_{{ic}}^{\sec }]{x}_{{ic}}^{o}={c}_{k}\\ 0\quad & \mathrm{else}\end{array}\right.,\end{eqnarray} \tag{ 22 }$

where k ∈ {0, 1,...,Q}. u_k ^o approximates ${\sum }_{{ic}\in {S}_{o}}{\kappa }_{{ic}}^{\sec }$ in the denominator of the double sum. Then, they defined the variables t^o as

$\begin{eqnarray}{t}^{o}=\left\{\begin{array}{ll}\tfrac{{\sum }_{{ic}}{\sum }_{i^{\prime} c^{\prime} }{\kappa }_{{ic}}^{\sec }{\kappa }_{i^{\prime} c^{\prime} }^{\sec }{\psi }_{{ic},i^{\prime} c^{\prime} }^{2}{y}_{{ic},i^{\prime} c^{\prime} }^{o}}{4{c}_{k}}\quad & \mathrm{if}\,{u}_{k}^{o}=1\,\mathrm{for}\ \mathrm{some}\,k\in \{1,2,\ldots ,Q\}\\ 0\quad & \mathrm{else}.\end{array}\right.\end{eqnarray} \tag{ 23 }$

For details on how close this approximation is to the actual term, see Nguyen et al. (2022). In terms of the new variables in DirILP, the objective function is:

$\begin{eqnarray}&&\displaystyle \sum _{o}\left({p}^{o}-\displaystyle \sum _{{ic}}{x}_{{ic}}^{o}\mathrm{ln}({\kappa }_{{ic}})+{\chi }_{1}^{o}{b}_{\min }+\epsilon \displaystyle \sum _{p=2}^{R}{\chi }_{p}^{o}+{t}^{o}\right),\end{eqnarray} \tag{ 24 }$

which is linear in all the new variables. Nguyen et al. used Gurobi (Gurobi Optimization, LLC 2023) to solve the mixed integer linear program. However, even setting up the problem takes a long time because the t^o variables involve $\displaystyle \left(\genfrac{}{}{0em}{}{| D| }{2}\right)$ terms where ∣D∣ is the number of sources-catalog pairs in the data set D.

3. Our Approach Using MIQCP

We improve on DirILP by introducing quadratic constraints and removing some terms from the objective function that are constant across any partition. We also propose a method to provably bound N, the maximum number of objects, which allows far fewer optimization variables to be created. A naive bound on the maximum number of objects is the total number of sources, which was used in DirILP. Our method to bound N method is described in the next section.

First, the objective function of our problem is

$\begin{eqnarray}&&\begin{array}{l}-\displaystyle \sum _{o}\mathrm{ln}({B}_{o})=\displaystyle \sum _{o}\left(1-| {S}_{o}| )\mathrm{ln}(2)-\displaystyle \sum _{{ic}}\mathrm{ln}({\kappa }_{{ic}})\right.\\ +\,\left.\mathrm{ln}\left(\displaystyle \sum _{{ic}}({\kappa }_{{ic}})\right)+\displaystyle \frac{{\sum }_{{ic}}{\sum }_{i^{\prime} c^{\prime} }{\kappa }_{{ic}}^{\sec }{\kappa }_{i^{\prime} c^{\prime} }^{\sec }{\psi }_{{ic},i^{\prime} c^{\prime} }^{2}}{4{\sum }_{{ic}}{\kappa }_{{ic}}^{\sec }}\right).\end{array}\end{eqnarray} \tag{ 25 }$

DirILP approximated the first term through the p^o variables. However, looking at the sum of the first term we see that

$\begin{eqnarray}&&\begin{array}{l}\displaystyle \sum _{o}(1-| {S}_{o}| )\mathrm{ln}(2)=\displaystyle \sum _{o}\mathrm{ln}(2)-\displaystyle \sum _{o}(| {S}_{o}| )\mathrm{ln}(2)\\ =\,K\mathrm{ln}(2)-| D| \mathrm{ln}(2),\end{array}\end{eqnarray} \tag{ 26 }$

where K is the number of nonempty objects in the solution and ∣D∣ is the total number of source-catalog pairs in the data set D. The latter term does not change between different partitions. So, it can be removed from the optimization problem. We can also see that

$\begin{eqnarray}&&\displaystyle \sum _{o}\displaystyle \sum _{{ic}\in {S}_{o}}\mathrm{ln}({\kappa }_{{ic}})=\displaystyle \sum _{{ic}\in D}\mathrm{ln}({\kappa }_{{ic}}).\end{eqnarray} \tag{ 27 }$

So, this term can also be removed from the optimization problem.

We introduce new binary variables q^o where

$\begin{eqnarray}&&{q}^{o}\geqslant {x}_{{ic}}^{o}\quad \forall {ic}\in {S}_{o},\end{eqnarray} \tag{ 28 }$

so that the number of nonempty objects is given by ∑_o q_o: since we are minimizing, in the optimal solution q^o will equal 0 when all ${x}_{{ic}}^{o}=0$ and 1 if any ${x}_{{ic}}^{o}=1$ for an object o.

We introduce quadratic constraints in order to improve t^o in DirILP. Unlike DirILP, our new method is exact for the double-sum term. Our formulation is based on Werner (2022), which uses mixed integer quadratically constrained programming (MIQCP) to solve the k-means clustering problem. As shown in Appendix, minimizing the weighted sum of squared distances is equivalent to minimizing the sum of squared distances to a subset's centroid. Let μ^o be a continuous vector with the same dimension as the source positions in the data set. Let r_ic be nonnegative continuous variables. Then, we can minimize the sum of the squared distances to a subset's centroid through the following program

$\begin{eqnarray}&&\begin{array}{l}\min \quad \displaystyle \sum _{{ic}}{r}_{{ic}}\\ {\rm{s.t.}}\quad I.\,{\kappa }_{{ic}}^{\sec }| | {x}_{{ic}}-{\mu }^{o}| {| }_{2}^{2}\leqslant {r}_{{ic}}+M(1-{x}_{{ic}}^{o})\\ \forall {ic}\in D,o\in [N]\\ {II}.\quad \displaystyle \sum _{o}{x}_{{ic}}^{o}=1\quad \forall {ic}\in D\\ {III}.\quad \displaystyle \sum _{i}{x}_{{ic}}^{o}\leqslant 1\quad \forall c\in [C],o\in [N],\end{array}\end{eqnarray} \tag{ 29 }$

where M is chosen large enough. Here, M can be chosen as the largest weighted pairwise distance between any two points in the data set. Constraint I, which is a convex quadratic constraint, makes sure that r_ic is at least the weighted squared distance between a source and the centroid of its assigned subset. Since we are minimizing, r_ic will be equal to the weighted squared distance. Constraint II ensures that all sources are assigned to an object. Constraint III ensures no two sources from the same catalog are assigned to the same object. With these new variables, we find that:

$\begin{eqnarray}&&\begin{array}{l}\displaystyle \sum _{o}\left(\displaystyle \frac{{\sum }_{{ic}}{\sum }_{i^{\prime} c^{\prime} }{\kappa }_{{ic}}^{\sec }{\kappa }_{i^{\prime} c^{\prime} }^{\sec }{\psi }_{{ic},i^{\prime} c^{\prime} }^{2}}{4{\sum }_{{ic}}{\kappa }_{{ic}}^{\sec }}\right)=\displaystyle \frac{1}{2}\displaystyle \sum _{{ic}}{r}_{{ic}}.\end{array}\end{eqnarray} \tag{ 30 }$

We use the same approximation as DirILP for the $\mathrm{ln}(\sum {\kappa }_{{ic}})$ term. Putting all of this together, we get that our final mixed integer quadratically constrained program is

$\begin{eqnarray}\begin{array}{ll}\min \quad & \displaystyle \frac{1}{2}\displaystyle \sum _{{ic}}{r}_{{ic}}+\displaystyle \sum _{o}\left(\Space{0ex}{3.55ex}{0ex}{q}^{o}\mathrm{ln}(2)+{\chi }_{1}^{o}{b}_{\min }\right.\\ & +\left.\displaystyle \sum _{{ic}\in {S}_{o}}{\chi }_{{ic}}^{o}\mathrm{ln}({\kappa }_{{ic}}\right)\\ {\rm{s.t.}}\quad & {\rm{Constraints}}\,{\rm{I,}}\,{\rm{II,}}\,{\rm{and}}\,{\rm{III}}\,{\rm{from}}\,{\rm{Equation}}\,{\rm{(29).}}\\ & {q}^{o}\geqslant {x}_{{ic}}^{o}\quad \forall {ic}\in {S}_{o}\\ & {x}_{{ic}}^{o}\in \{0,1\}\quad \forall {ic}\in D,o\in [N]\\ & {\chi }_{p}^{o}\in \{0,1\}\quad \forall o\in [N],p\in [R]\\ & {\chi }_{1}^{o}\geqslant {\chi }_{2}^{o}\,\geqslant ...\geqslant \,{\chi }_{R}^{o}\quad \forall o\in [N]\\ & {\chi }_{1}^{o}{e}^{{b}_{1}}+\displaystyle \sum _{p=2}^{R}{\chi }_{p}^{o}\left({e}^{{b}_{p}}-{e}^{{b}_{p-1}}\right)\\ & \geqslant \displaystyle \sum _{{ic}}{\kappa }_{{ic}}{x}_{{ic}}^{o}\quad \forall o\in [N].\end{array}\end{eqnarray} \tag{ 31 }$

Using quadratic constraints, we are able to capture the double-sum term in O(∣D∣) terms rather than O(∣D∣²) terms in DirILP. This dramatically reduces the time to set up the program since ∣D∣ grows quickly with the number of catalogs. This change brings the problem into a more complex class of mixed integer problems with different optimization algorithms. However, Gurobi has improved their MIQCP solver a lot in the past several years, and our results in Section 5 show that their MIQCP solver is fast.

4. Numerical Accuracy and Bounding Improvements

We propose two improvements that allowed us to further speed up the solver by multiple orders of magnitude on top of the MIQCP formulation. These improvements allow us to both reduce the number of optimization variables needed and reduce the magnitude of numbers we input into the integer programming solver. Integer programming solvers perform much better with fewer variables and with small-magnitude numbers.

4.1. Numerical Accuracy Improvements

For the simulated data, the uncertainties are very small when converted to radians (≤10⁻⁵). So, the κ_ic terms are very large (>10¹⁰). This poses a numerical problem for the κ_ic in rad⁻², particularly for the $\mathrm{ln}({\kappa }_{{ic}})$ terms. However, we can avoid this since we are taking the natural logarithm of all of our terms.

We convert between ${\kappa }^{\sec }$ and rad⁻² using Equation (7). Converting all the ${\kappa }^{\sec }$ terms to rad⁻² prior to taking the natural logarithm leads to numerical errors. This calculation requires floating-point arithmetic, which suffers from numerical errors for large numbers. These errors are small relative to the large numbers. However, since we are reducing the numerical magnitude by taking the logarithm of these large numbers, the relative errors are magnified. So, we instead can use the following identities:

$\begin{eqnarray}&&\displaystyle \sum _{{ic}\in {S}_{o}}\mathrm{ln}({ \mathcal R }{\kappa }_{{ic}}^{\sec })=| {S}_{o}| \mathrm{ln}({ \mathcal R })+\displaystyle \sum _{{ic}\in {S}_{o}}\mathrm{ln}({\kappa }_{{ic}}^{\sec }),\end{eqnarray} \tag{ 32 }$

$\begin{eqnarray}&&\mathrm{ln}\left(\displaystyle \sum _{{ic}\in {S}_{o}}{ \mathcal R }{\kappa }_{{ic}}^{\sec }\right)=\mathrm{ln}({ \mathcal R })+\mathrm{ln}\left(\displaystyle \sum _{{ic}\in {S}_{o}}{\kappa }_{{ic}}^{\sec }\right).\end{eqnarray} \tag{ 33 }$

By taking the logarithm of our large numbers before multiplication, we can reduce the relative error in the final term. Equation (32) is only useful in calculating the negative logarithm of the Bayes factor and not the optimization since it can be removed from the optimization problem as shown in Section 3. Equation (33) is useful in the optimization to reduce the magnitude of numbers going into the solver. We can handle the $\mathrm{ln}({ \mathcal R })$ term since it is constant and only affects the optimization when an object is nonempty. So, in the final optimization, we optimize over ${\sum }_{o}{q}^{o}\mathrm{ln}({ \mathcal R })$ since q^o is only nonzero if the subset for object o is nonempty.

4.2. Bounding Algorithm

Here, we present an algorithm to provide a better upper bound on the maximum number of nonempty objects, denoted N. We show empirically in Section 5 that the bounding algorithm dramatically decreases the number of decision variables and the overall runtime of the MIQCP solver. To give an upper bound on N, we use a good feasible solution with objective function value $\hat{B}$ . This good feasible solution can be found by stopping the MIQCP method early. We then use Algorithm 1 to find that upper bound. Solving a relaxed MIQCP problem provides a lower bound on the optimal objective function value for the original MIQCP problem. So, by iteratively restricting the minimum number of nonempty objects, we can provide an upper bound on N, which we denote $\hat{N}$ .

Algorithm 1. An algorithm to find an upper bound on the maximum number of hypothetical objects.

1: $\hat{N}\to 0$ $\hat{N}\to 0$ .

2:While $\hat{N}\lt N$ $\hat{N}\lt N$ do.

3: Create an instance of the MIQCP model.

4: Add a constraint ${\sum }_{o}{q}^{o}\geqslant \hat{N}$ ${\sum }_{o}{q}^{o}\geqslant \hat{N}$ .

5: Relax the MIQCP model by removing binary and integer constraints.

6: Solve the relaxed problem to get a lower bound on the new problem's optimal objective function value, L.

7: If $L\gt \hat{B}$ $L\gt \hat{B}$ then

8: return $\hat{N}$ $\hat{N}$ ,

9: else

10: $\hat{N}\to \hat{N}+1$ $\hat{N}\to \hat{N}+1$ .

11: Return $\hat{N}$ $\hat{N}$ .

Download table as: ASCII Typeset image

As seen in our MIQCP formulation, N is present whenever we sum over all of the hypothetical objects. So, reducing N can dramatically reduce the number of variables we have and allow Gurobi to solve our problem much faster. This runtime improvement is shown for the simulated data in the next section.

5. Results and Performance on Simulated Data

We tested the new MIQCP solver on simulated catalogs. We used Gurobi (Gurobi Optimization, LLC 2023) to set up and solve DirILP and our MIQCP method. All of our code is available on GitHub (Feitelberg 2023). The computer we run our tests on has an Intel® Core^TM i9-9900K CPU and 32 GB of RAM.

We create the synthetic astronomy catalogs as described by Nguyen et al. (2022). Using the flat-sky approximation appropriate for the typical large compactness parameters κ = 1/σ² of modern surveys, i.e., small directional uncertainties, we simulate two objects by drawing sources from multiple Gaussian distributions. These distributions represent the directions of two objects. For each simulated object, we draw a source from its distribution. In practice, we can first separate sources that have a large angular separation, in relation to their uncertainties, into connected islands to allow for parallel processing of smaller optimization problems.

Nguyen et al. (2022) used DBSCAN in Scikit-Learn to isolate islands. For our simulations, we skip this step and only simulate two objects. We set the distance between the objects to be close enough that the two boundaries of samples drawn from the distributions overlap. Nguyen et al. also set the MIP Gap (optimality gap) to 0.5% to reduce the runtime to prove optimality. However, for our simulations, we keep the MIP Gap at the default value of 0.01%. So, our results are closer to optimal. We also do not enforce any of the heuristics in Section 4 of Nguyen et al. (2022).

Figure 1 illustrates the performance of the new method compared to Chainbreaker for a simulation with 20 catalogs. MIQCP is able to recover the true solution whereas Chainbreaker predicts the wrong number of objects and provides a worse solution than MIQCP. This is because Chainbreaker only considers removing the largest edge in the pairwise distance graph, but not all other combinations of associations. If the largest edge removal does not lead to a better Bayes factor, then Chainbreaker stops searching.

Our results in Figure 2 show that for more than 5 catalogs, our MIQCP method is much faster than DirILP. As shown by the error bars, the runtime can vary a lot for a larger number of catalogs for both DirILP and our MIQCP method. This behavior is common with integer programming solvers. As shown in Figure 3, the MIQCP setup is much faster than DirILP. Our results in Figure 4 also show while DirILP cannot process more than 25 catalogs in a reasonable amount of time, our MIQCP method can solve up to 35 catalogs. For these results, we set a reasonable amount of time to 45 minutes.

farcs — **Figure 2.** Simulation Total Runtime vs. Number of Catalogs for DirILP and MIQCP. We set the distance between the object centers to 013. We set the uncertainties for both objects to 004. For these runtimes, we use the algorithm to find a better upper bound on the maximum number of hypothetical objects. We repeat the simulations five times for each setting. MIQCP has a faster total runtime than DirILP.The DirILP line does not have data points for 30 and 35 catalogs because we stopped the program after 45 minutes. We show in Figure 4 exactly what percentage of the runs succeeded in finishing before 45 minutes. The error bars at the end of DirILP and MIQCP are zero because only 1 out of 5 runs finished on time for both methods.
Download figure:
Standard image High-resolution image

**Figure 3.** Setup Time vs. Number of Catalogs for DirILP and MIQCP. We set the distance between the object centers to 013. We set the uncertainties for both objects to 004. For these runtimes, we do not use the algorithm to find a better upper bound on the maximum number of hypothetical objects. We repeat the simulations 5 times for each setting. MIQCP has a faster setup time than DirILP.
Download figure:
Standard image High-resolution image

**Figure 4.** Percent of Optimization Problems Solved vs. Number of Catalogs for DirILP and MIQCP. For these results, we set a reasonable amount of time to 45 minutes. We set the distance between the object centers to 013. We set the uncertainties for both objects to 004. For these runtimes, we use the algorithm to find a better upper bound on the maximum number of hypothetical objects. We repeat the simulations five times for each setting. MIQCP is able to solve more problems within 45 minutes than DirILP.
Download figure:
Standard image High-resolution image

Finally, to demonstrate the importance of finding a good upper bound on the maximum number of hypothetical objects, we manually set the bound to 2 and ran our MIQCP method. As seen in Figure 5, the runtime is reduced by multiple orders of magnitude, and we are able to provide globally optimal solutions in under 20 minutes even for 80 catalogs.

**Figure 5.** MIQCP Runtime vs. Number of Catalogs with $\hat{N}=2$ . We manually set the maximum number of hypothetical objects to 2. We set the distance between the object centers to 013. We set the uncertainties for both objects to 004. We repeat the simulations five times for each setting. Comparing to the runtimes in Figure 2, reducing $\hat{N}$ dramatically reduces the overall runtime.
Download figure:
Standard image High-resolution image

**Figure 5.** MIQCP Runtime vs. Number of Catalogs with $\hat{N}=2$ . We manually set the maximum number of hypothetical objects to 2. We set the distance between the object centers to 013. We set the uncertainties for both objects to 004. We repeat the simulations five times for each setting. Comparing to the runtimes in Figure 2, reducing $\hat{N}$ dramatically reduces the overall runtime.
Download figure:
Standard image High-resolution image

6. Results on Real-world Data

We also tested our new method on real-world data from the Hubble Source Catalog (HSC; Whitmore et al. 2016).⁴ The only other previous algorithm tested on this data was the chainbreaker algorithm Budavári & Lubow (2012). The HSC is built from the Hubble Legacy Archive (HLA) source lists. The HLA contains extracted sources in white-light images that combines photons across multiple photometric bands to increase the signal-to-noise ratio before detection. The data can be found at the HSC website.⁵

A match is a collection of sources that is far away from other sources, equivalent to an island in Nguyen et al. (2022). Each match was created using the original chainbreaker method on the entire data set with a large distance threshold. We found that the real-world data was dissimilar to our simulated catalogs. As shown in Figure 7, real-world sources do not always follow the circular structure like in the simulated data. Also, during simulation, we always drew 2 samples from each object's distribution for each catalog. In real-world data though, some objects only show up in a subset of all the catalogs. This is apparent in the match shown in Figure 6 where each catalog only has 1 source despite there being at least three clearly defined objects.

**Figure 7.** HSC match with 4 catalogs. In the first plot, we color each point according to its catalog. We also shade the circle of radius σ around the source center. In the second plot, we use pairwise lines to show which points are assigned to the same object by the Chainbreaker method. MIQCP and DirILP could not find a solution for this match within 45 minutes.
Download figure:
Standard image High-resolution image

For this section, we chose matches that were small enough to run our algorithms on. As shown in the match in Figure 6, DirILP was not able to process even a small real-world match. What is also seen in the real-world data is that there are many other variables that affect the runtime of any integer programming method such as the values of κ_ic, how many total sources there are, and the distances between points. However, some matches have hundreds or thousands of sources in them, which is outside the capabilities of our new method and any prior exact method.

7. Conclusion

In this paper, we introduced a novel approach to globally optimal catalog matching and have demonstrated significant improvements in the performance of finding provably optimal catalogs. Our new MIQCP method is a modification of DirILP that introduces quadratic constraints to shrink the size of the problem dramatically, which reduces the runtime by multiple orders of magnitude. While no exact method will be able to overcome the exponential runtime inherent to the combinatorial nature of the matching problem, our new method allows us to set up and solve the problem a fraction of the time as DirILP, which makes our MIQCP method a clear choice in all cases. Future improvements to the MIQCP method might include better approximations for the log-sum part of the DirILP formulation, which we did not modify here. In the immediate future, we plan on testing new approximate algorithms on more real-world data from the HSC to process matches that have hundreds or thousands of sources in them.

Acknowledgments

This paper is based on work supported by the National Science Foundation (NSF), NASA, and the Air Force Office of Scientific Research (AFOSR). T.B. gratefully acknowledges funding from NSF via awards 1909709, 1814778, 2206341, and NASA via STScI-52333 under NAS5-26555. A.B. gratefully acknowledges funding from AFOSR grant FA95502010341 and NSF grant CCF2006587. Also, the authors thank the anonymous referee for the careful review and the thoughtful comments.

Appendix: Equivalence of Weighted Centroid Distance and Pairwise Distances

Here, we provide a proof that the weighted sum of pairwise squared distances is equivalent to the weighted sum of squared distances to cluster centroids. Let ${x}_{i}\in {{\mathbb{R}}}^{d}$ for i = 1,..., n. Each sample x_i has a weight α_i, i = 1,..., n. The weighted mean for a cluster is

$\begin{eqnarray}&&\mu =\displaystyle \frac{{\sum }_{i=1}^{n}{\alpha }_{i}{x}_{i}}{{\sum }_{i=1}^{n}{\alpha }_{i}}.\end{eqnarray} \tag{ A1 }$

Here, we show that

$\begin{eqnarray}&&\displaystyle \sum _{i=1}^{n}\displaystyle \sum _{j=1}^{n}{\alpha }_{i}{\alpha }_{j}| | {x}_{i}-{x}_{j}| {| }^{2}=\displaystyle \sum _{i=1}^{n}{\alpha }_{i}| | {x}_{i}-\mu | {| }^{2},\end{eqnarray} \tag{ A2 }$

as demonstrated below:

$\begin{eqnarray}&&\begin{array}{l}\displaystyle \sum _{i=1}^{n}\displaystyle \sum _{j=1}^{n}{\alpha }_{i}{\alpha }_{j}| | {x}_{i}-{x}_{j}| {| }^{2}\\ =\,\displaystyle \sum _{i=1}^{n}\displaystyle \sum _{j=1}^{n}{\alpha }_{i}{\alpha }_{j}\left(| | {x}_{i}| {| }^{2}-2{x}_{i}^{T}{x}_{j}+| | {x}_{j}| {| }^{2}\right)\\ =\,\left(\displaystyle \sum _{i=1}^{n}\displaystyle \sum _{j=1}^{n}{\alpha }_{i}{\alpha }_{j}| | {x}_{i}| {| }^{2}\right)-2\left(\displaystyle \sum _{i=1}^{n}\displaystyle \sum _{j=1}^{n}{\alpha }_{i}{\alpha }_{j}{x}_{i}^{T}{x}_{j}\right)\\ \quad +\,\left(\displaystyle \sum _{i=1}^{n}\displaystyle \sum _{j=1}^{n}{\alpha }_{i}{\alpha }_{j}| | {x}_{j}| {| }^{2}\right)\\ =\,2\left(\displaystyle \sum _{i=1}^{n}\displaystyle \sum _{j=1}^{n}{\alpha }_{i}{\alpha }_{j}| | {x}_{i}| {| }^{2}\right)-2\left(\displaystyle \sum _{i=1}^{n}\displaystyle \sum _{j=1}^{n}{\alpha }_{i}{\alpha }_{j}{x}_{i}^{T}{x}_{j}\right)\\ =\,2\left(\displaystyle \sum _{i=1}^{n}\displaystyle \sum _{j=1}^{n}{\alpha }_{i}{\alpha }_{j}| | {x}_{i}| {| }^{2}\right)-4\left(\displaystyle \sum _{i=1}^{n}\displaystyle \sum _{j=1}^{n}{\alpha }_{i}{\alpha }_{j}{x}_{i}^{T}{x}_{j}\right)\\ \quad +\,2\left(\displaystyle \sum _{i=1}^{n}\displaystyle \sum _{j=1}^{n}{\alpha }_{i}{\alpha }_{j}{x}_{i}^{T}{x}_{j}\right).\end{array}\end{eqnarray} \tag{ A3 }$

A few intermediate identities:

$\begin{eqnarray}\begin{array}{rcl}| | \mu | {| }^{2} & = & \displaystyle \frac{| | {\sum }_{i=1}^{n}{\alpha }_{i}{x}_{i}| {| }^{2}}{{\left({\sum }_{i=1}^{n}{\alpha }_{i}\right)}^{2}}\\ & = & \displaystyle \frac{{\sum }_{i=1}^{n}{\sum }_{j=1}^{n}{\alpha }_{i}{\alpha }_{j}{x}_{i}^{T}{x}_{j}}{{\left({\sum }_{i=1}^{n}{\alpha }_{i}\right)}^{2}}\\ | | \mu | {| }^{2}{\left(\displaystyle \sum _{i=1}^{n}{\alpha }_{i}\right)}^{2} & = & \displaystyle \sum _{i=1}^{n}\displaystyle \sum _{j=1}^{n}{\alpha }_{i}{\alpha }_{j}{x}_{i}^{T}{x}_{j}.\end{array}\end{eqnarray} \tag{ A4 }$

Next,

$\begin{eqnarray}&&\begin{array}{l}4\displaystyle \sum _{i=1}^{n}\displaystyle \sum _{j=1}^{n}{\alpha }_{i}{\alpha }_{j}{x}_{i}^{T}{x}_{j}=4\displaystyle \sum _{i=1}^{n}\left({\alpha }_{i}{x}_{i}^{T}\displaystyle \sum _{j=1}^{n}{\alpha }_{j}{x}_{j}\right)\\ =\,4\left(\displaystyle \sum _{j}{\alpha }_{j}\right)\displaystyle \sum _{i=1}^{n}\left({\alpha }_{i}{x}_{i}^{T}\displaystyle \frac{\displaystyle \sum _{j=1}^{n}{\alpha }_{j}{x}_{j}}{\displaystyle \sum _{j}{\alpha }_{j}}\right)\\ =\,4\left(\displaystyle \sum _{j}{\alpha }_{j}\right)\displaystyle \sum _{i=1}^{n}{\alpha }_{i}{x}_{i}^{T}\mu .\end{array}\end{eqnarray} \tag{ A5 }$

Putting this together,

$\begin{eqnarray}&&\begin{array}{l}\displaystyle \sum _{i=1}^{n}\displaystyle \sum _{j=1}^{n}{\alpha }_{i}{\alpha }_{j}| | {x}_{i}-{x}_{j}| {| }^{2}\\ =\,2\left(\displaystyle \sum _{j}{\alpha }_{j}\right)\displaystyle \sum _{i=1}^{n}{\alpha }_{i}| | {x}_{i}| {| }^{2}-4\left(\displaystyle \sum _{j}{\alpha }_{j}\right)\displaystyle \sum _{i=1}^{n}{\alpha }_{i}{x}_{i}^{T}\mu \\ +\ 2| | \mu | {| }^{2}{\left(\displaystyle \sum _{i=1}^{n}{\alpha }_{i}\right)}^{2}\\ =\,2\left(\displaystyle \sum _{j}{\alpha }_{j}\right)\displaystyle \sum _{i=1}^{n}{\alpha }_{i}| | {x}_{i}| {| }^{2}\\ -\ 4\left(\displaystyle \sum _{j}{\alpha }_{j}\right)\displaystyle \sum _{i=1}^{n}{\alpha }_{i}{x}_{i}^{T}\mu +2\left(\displaystyle \sum _{j=1}^{n}{\alpha }_{j}\right)\displaystyle \sum _{i=1}^{n}{\alpha }_{i}| | \mu | {| }^{2}\\ =\,2\left(\displaystyle \sum _{j}{\alpha }_{j}\right)\left(\displaystyle \sum _{i=1}^{n}{\alpha }_{i}| | {x}_{i}| {| }^{2}-2{\alpha }_{i}{x}_{i}^{T}\mu +{\alpha }_{i}| | \mu | {| }^{2}\right)\\ =\,2\left(\displaystyle \sum _{j}{\alpha }_{j}\right)\left(\displaystyle \sum _{i=1}^{n}{\alpha }_{i}(| | {x}_{i}| {| }^{2}-2{x}_{i}^{T}\mu +| | \mu | {| }^{2})\right)\\ =\,2\left(\displaystyle \sum _{j}{\alpha }_{j}\right)\left(\displaystyle \sum _{i=1}^{n}{\alpha }_{i}| | {x}_{i}-\mu | {| }^{2}\right)\\ \displaystyle \frac{\displaystyle \sum _{i=1}^{n}\displaystyle \sum _{j=1}^{n}{\alpha }_{i}{\alpha }_{j}| | {x}_{i}-{x}_{j}| {| }^{2}}{2\displaystyle \sum _{j}{\alpha }_{j}}=\displaystyle \sum _{i=1}^{n}{\alpha }_{i}| | {x}_{i}-\mu | {| }^{2}.\end{array}\end{eqnarray} \tag{ A6 }$

So, the sum of the squared distances between points in clusters is equivalent the weighted sum of the squared distance to the centroids.

Fast Globally Optimal Catalog Matching using MIQCP

Article metrics

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. Globally Optimal Matching

2.1. Problem Formulation

2.2. Prior State of the Art

2.3. The DirILP Solver

3. Our Approach Using MIQCP

4. Numerical Accuracy and Bounding Improvements

4.1. Numerical Accuracy Improvements

4.2. Bounding Algorithm

5. Results and Performance on Simulated Data

6. Results on Real-world Data

7. Conclusion

Acknowledgments

Appendix: Equivalence of Weighted Centroid Distance and Pairwise Distances

Footnotes

Fast Globally Optimal Catalog Matching using MIQCP

Article metrics

Share this article

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. Globally Optimal Matching

2.1. Problem Formulation

2.2. Prior State of the Art

2.3. The DirILP Solver

3. Our Approach Using MIQCP

4. Numerical Accuracy and Bounding Improvements

4.1. Numerical Accuracy Improvements

4.2. Bounding Algorithm

5. Results and Performance on Simulated Data

6. Results on Real-world Data

7. Conclusion

Acknowledgments

Appendix: Equivalence of Weighted Centroid Distance and Pairwise Distances

Footnotes