Interpretable Machine Learning for Finding Intermediate-mass Black Holes

Definitive evidence that globular clusters (GCs) host intermediate-mass black holes (IMBHs) is elusive. Machine-learning (ML) models trained on GC simulations can in principle predict IMBH host candidates based on observable features. This approach has two limitations: first, an accurate ML model is expected to be a black box due to complexity; second, despite our efforts to simulate GCs realistically, the simulation physics or initial conditions may fail to reflect reality fully. Therefore our training data may be biased, leading to a failure in generalization to observational data. Both the first issue—explainability/interpretability—and the second—out of distribution generalization and fairness—are active areas of research in ML. Here we employ techniques from these fields to address them: we use the anchors method to explain an Extreme Gradient Boosting (XGBoost) classifier; we also independently train a natively interpretable model using Certifiably Optimal RulE ListS (CORELS). The resulting model has a clear physical meaning, but loses some performance with respect to XGBoost. We evaluate potential candidates in real data based not only on classifier predictions but also on their similarity to the training data, measured by the likelihood of a kernel density estimation model. This measures the realism of our simulated data and mitigates the risk that our models may produce biased predictions by working in extrapolation. We apply our classifiers to real GCs, obtaining a predicted classification, a measure of the confidence of the prediction, an out-of-distribution flag, a local rule explaining the prediction of XGBoost, and a global rule from CORELS.


INTRODUCTION
The detection of quasars at high redshift (see, e.g., Mortlock et al. 2011;Schindler et al. 2023;Maiolino et al. 2023) requires a mechanism for the rapid assembly of supermassive black holes.Intermediate-mass black holes (IMBHs) bridge the gap between stellar-mass remnants and supermassive black holes, potentially playing an important role as seeds for the latter (see, e.g., Woods et al. 2019 andVolonteri et al. 2021 for two recent reviews).Searches for IMBHs in the present-day Universe are being actively carried out, with focus on high-density environments such as the Galactic center (Oka et al. 2016;Ballone et al. 2018;Takekawa et al. 2019bTakekawa et al. ,a, 2020;;Kaneko et al. 2023;The GRAVITY Collaboration et al. 2023), dwarf galaxies (Mezcua et al. 2016(Mezcua et al. , 2018)), and globular clusters (GCs; see e.g.Maccarone & Servillat 2008;Cseh et al. 2010;Strader et al. 2012;Su et al. 2022;Farrell et al. 2009;Bachetti et al. 2014;Kains et al. 2016;Kızıltan et al. 2017;Lin et al. 2018).Regarding Galactic GCs, the most comprehensive study providing direct IMBH mass upper limits based on radio observations is Tremou et al. (2018), in the following T18.

Applying machine learning (ML) models to indirect IMBH detection
The current prevailing approach for indirect IMBH detection is comparing a measured physical quantity and its prediction according to a theoretical model of the specific phenomena affected by the presence of an IMBH.These handcrafted physical models clearly present the advantage of full interpretability.But it is (at least in principle) possible that more flexible data-driven models can significantly outperform them.On the other hand, several data-driven approaches to IMBH detection based on machine learning (ML), while promising, have at least two clear drawbacks: first, the general expectation is that a high-performance model must be very complex, virtually becoming a black box1 , with few notable exceptions.Such a model could be a deep neural network or a decision tree ensemble.These kinds of models are hard to interpret due to their complexity, irrespective of the soundness of the statistical foundations on which they are built.For instance proving that a neural network is a universal function approximator (Hornik et al. 1989) is scant consolation for the fact that humans can only make sense of the inner workings of a trained neural network model through laborious analysis that resembles experimental biology more than mathematics (this reverse engineering work constitutes the newborn field of 'mechanistic interpretability' see, e.g., Olah et al. 2017;Carter et al. 2019;Nanda et al. 2023).Second, the model would have to be trained on simulations.Even in the unlikely event that simulations were to correctly model most of the relevant physics, they are quite likely to be run from unrealistic initial conditions, because constraining the early stages of cluster evolution is essentially still an open problem (see e.g.Torniamenti et al. 2022).The simulated data on which a data-driven model will be trained can therefore be biased by containing e.g.spurious associations between variables or by under-or over-representing instances with certain characteristics.In this paper, we address both issues.

Explainability versus native interpretability
We build a baseline black-box model based on XGBoost (Chen & Guestrin 2016).We treat XGBoost as a black box since it is impossible for a human within a reasonable time frame to make sense of the weighed contributions of thousands of decision trees.Details on XGBoost are given in Section 3. We show that the black-box model behavior can be explained locally using anchors (Ribeiro et al. 2018).Anchors are logical rules that apply in the vicinity of a given instance, explaining the black box model locally -in the sense that they are faithful to the underlying model in a neighborhood of a selected instance.In concrete, anchors are combinations of few binary choices obtained by thresholding one or more features (for instance, relaxation time greater than two Gyr and half-mass radius smaller than three pc), which predict the behavior of the underlying classifier (predicting IMBH host versus non-host) on the majority of data points around the one for which we want an explanation.They have been shown empirically to be readily understandable by humans (Ribeiro et al. 2018).Indeed, the anchors we find are often physically meaningful as we discuss below.
Anchors are one of the many schemes recently introduced to explain black box ML models.Such explanations are necessarily unable to fully capture the behavior of the model they are meant to explain (for instance, anchors are local explanations, and differ based on the data point we seek an explanation for; moreover, sometimes no suitable anchor exists given our requirements).If this were not the case, we could ditch the original model and use the explanations directly for prediction.Following this line of argument, natively interpretable models, i.e. models that are simple enough to be understandable by humans directly, have the advantage of not needing a post-hoc explanation.Rudin (2019) for instance has argued that, with properly engineered features, supervised problems on tabular data can be tackled effectively by interpretable models without the need to rely on black boxes.We thus train a fully interpretable model based on decision rules, Certifiably Optimal RulE ListS (CORELS; Angelino et al. 2017), to assess how it compares to XGBoost in terms of performance.The CORELS model finds rules that have a straightforward interpretation in terms of GC physics; moreover the rule list it finds bears some resemblance to the anchor explanations for XGBoost.

Training on simulations, predicting on observations
A systematic discussion of how out-of-distribution generalization may impact the application of ML methods in the context of astronomy is Acquaviva et al. (2020), in which the authors try to measure the increase in generalization error due to training on cosmological simulations and predicting on actual observational data.Their discussion clearly applies to our case, but it focuses mostly on measuring a global distance between the training data and the data set on which the models are eventually deployed, evaluating the effects on overall generalization error.In this paper we will focus on deciding whether individual predictions should be trusted; thus we will be interested in measuring how far an individual data point corresponding to an actual, observed physical system is from the simulated training data.

Training, validation and test datasets
To build a training and validation sample for our ML models, and to test them, we used results from the MOCCA-Survey Database I simulations described in detail by Askar et al. (2017b).The MOCCA-Survey Database comprises nearly 2000 star cluster simulations with different initial parameters that were carried out using the MOCCA code (Hypki & Giersz 2013;Giersz et al. 2013), which is based on the Monte Carlo algorithm (Hénon 1971;Stodolkiewicz 1982Stodolkiewicz , 1986) ) for treating the long-term evolution of star clusters.The orbit-averaged Monte Carlo method combines a statistical approach for the treatment of distant two-body interactions that drive the dynamical evolution of a star cluster with the particle based approach of N -body methods (Giersz 2001;Joshi et al. 2001;Hypki & Giersz 2013;Rodriguez et al. 2022).This approach allows for the implementation of important physical processes and the possibility to simulate star cluster models with several hundred thousands to millions of stars within a few days to weeks.MOCCA incorporates prescriptions for stellar and binary evolution based on the SSE/BSE codes (Hurley et al. 2000(Hurley et al. , 2002)).For computing the outcome of close dynamical interactions between binary-single stars and binary-binary stars, MOCCA uses the fewbody code (Fregeau et al. 2004) which is a direct N -body integrator for small-N gravitational dynamics.MOCCA also implements a realistic treatment of escape processes in tidally limited clusters based on Fukushige & Heggie (2000).In order to model the Galactic potential, MOCCA uses a simple point-mass approximation.The treatment of escapers from the tidally limited GC models is based on Fukushige & Heggie (2000).MOCCA has been comprehensively tested and compared with results from direct N -body codes (see e.g.Giersz et al. 2013;Heggie 2014;Wang et al. 2016;Madrid et al. 2017;Giersz et al. 2019).
The star clusters models simulated in the MOCCA-Survey Database I span a wide range of initial parametes with different number of objects, metallicity, binary fraction and parameter distribution, central concentration, tidal and half-mass radii, and different natal kick prescriptions for stellar-mass BHs (see Table 1 in Askar et al. 2017b).Initial stellar masses in each cluster model were sampled using the Kroupa (2001) initial mass function (IMF) with minimum and maximum stellar masses of 0.08 M ⊙ and 100.0 M ⊙ .The initially densest (ρ c > 10 6 M ⊙ pc −3 ) GC models form an IMBH (≥ 500 M ⊙ ) within a few hundred Myr of dynamical evolution through a combination of collisions, mergers and mass transfer events onto a seed BH.This seed typically forms from a merger between a BH and a very massive main-sequence star (> 50 M ⊙ ).The latter typically form via the runaway merger of stars during the early evolution (within 100 Myr) of these dense GC models.The rapid formation of a seed IMBH in these MOCCA models has been described as what is known as the fast IMBH formation scenario (Giersz et al. 2015).
In other more moderately dense models (10 4 M ⊙ pc −3 ≲ ρ c ≲ 10 6 M ⊙ pc −3 ), the IMBH forms after a few Gyr of dynamical evolution from mergers during binary interactions involving a stellar-mass BH.This has been referred to as the slow IMBH formation scenario.Both the fast and slow IMBH formation mechanisms in the MOCCA GC models have been discussed in detail in Giersz et al. (2015).In most of these simulated cluster models, for mergers involving a BH and a star, it is assumed that the BH accretes the entirety of the mass of the star that it merges with (Hong et al. 2020;Askar et al. 2021).Furthermore, gravitational wave recoil kick following the merger of two stars are also not implemented (Morawski et al. 2018;Maliszewski et al. 2022).Both these assumptions facilitate the formation and growth of IMBHs in these cluster models.
At any rate, any set of simulations having a shot at realism will explore a large space of parameters that may lead to potentially unphysical outcomes.For instance, stellar evolution parameters such as the amount of mass fallback in the event of supernova explosion may affect the formation of stellar-mass black holes, leading to the simulations spanning a wider range of mass-to-light ratios with respect to typical Galactic GCs.However our knowledge of real star clusters is not complete: recent evidence suggests that exploring a wide range of mass-to-light ratios (e.g.0.8 to 5.03 in our simulated clusters) may indeed be a feature rather than a bug, given the possibility of e.g.hypercompact star clusters (Greene et al. 2021).The important safeguard is that the potential lack of realism of the simulations used to train a ML model is taken into account and mitigated appropriately, as we discuss below.
From the 1298 GC models that evolved up to at least 12 Gyr in the MOCCA-Survey Database I, 388 harboured an IMBH more massive than 500 M ⊙ ; in the following we adopted a more inclusive definition of IMBH, extending the range to 100 M ⊙ and above: this choice makes the classification problem harder and so is more conservative.Similar to the approach taken in Askar et al. (2019), we used the simulation snapshot at 12 Gyr for each GC model to determine the following features: Spitzer (2014) half-mass relaxation time corresponding to the half-mass radius (HRT), total luminosity of cluster in units of solar luminosity (TVL), mass-weighted central velocity dispersion (CVD) in units of km/s, central surface brightness in units of L ⊙ /pc 2 from King (1962) fitting of cumulative luminosity profile (CSB), core radius in pc obtained from King (1962) (OCR), Half-light radius obtained from King (1962) fitting in pc (OHLR).Fig. 1 shows a pair plot of these features on the adopted dataset from MOCCA-Survey Database I.
The simulation data was partitioned into a train-validation dataset, where we used k-fold cross validation as described below, and an internal test dataset comprising 10% of the initial sample.This was stratified with respect to the label, i.e. each subset was forced to contain approximately the same proportion of IMBH hosts as the whole sample.

Deployment dataset
After training our models on simulated data, we used them to predict the presence of an IMBH on real GCs.To this end we use the observational features obtained by Baumgardt & Hilker (2018) on a sample of 161 GCs in the Milky Way as a final dataset to deploy our models.The data was obtained by accessing https://people.smp.uq.edu.au/HolgerBaumgardt/globular/parameter.html.Observational data can in principle be out-of-distribution with respect to the simulations comprising our training set.We discuss how we are mitigating this issue by identifying out-of-distribution points using Kernel Density Estimation (KDE) in Section 3.4.

K-fold cross-validation
K-fold cross-validation is a widely used technique in ML to evaluate the generalization performance of a model on a given dataset.The process of k-fold cross-validation involves partitioning a dataset into k subsets, or folds, of approximately equal size.For each iteration, k-1 of the folds are used as training data, and the remaining fold is used as the validation set.This process is repeated k times, with each fold serving as the validation set once.The average performance of the model across all k iterations is then used as an estimate of its performance on unseen data.The major advantage of this technique is that it allows for a more robust estimate of the model's performance by utilizing all of the data for both training and validation, thus reducing the risk of overfitting.In this work, we utilize k = 5 and stratify the sampling so to have roughly the same fraction of IMBH hosts and non-hosts in each fold.

Performance metrics
We evaluate the performance of our classifiers in terms of two metrics: precision and recall.Precision, also referred to as purity in astronomical catalogs, is the fraction of relevant instances (real IMBH hosts) among the retrieved instances (IMBH host claims), while recall, also known as completeness, is the fraction of the total amount of relevant instances (real IMBH hosts) that were actually retrieved (claimed to be hosts).If a classifier returns a number between 0 and 1 as a measure of its confidence that a given instance belongs to the designated class (0.642 meaning for instance that a given GC is predicted to be an IMBH host with 64.2% probability), it is possible to choose a threshold to convert such number to a hard prediction.For any given threshold we can thus obtain a value for precision and one for recall.Repeating this procedure for different thresholds we obtain a precision-recall curve, which illustrates the trade-off between precision and recall for different threshold values of a binary classifier.This curve shows the ability of our classifier to maintain a balance between avoiding false positives and capturing true positives as the classification threshold is varied.If on the other hand a classifier returns only a hard classification, this procedure cannot be carried out and only one (precision, recall) couple is calculated.
It is worth pointing out that the nature of the problem of IMBH detection calls for a comparison mostly on the precision metric.Precision is more important than recall because false positives (IMBH claims in the absence of an actual IMBH) would waste telescope time and other resources needed for an IMBH-candidate follow-up.On the other hand, just one confirmed claim would have momentous astrophysical implications, so false negatives (missed IMBH hosts) are not problematic if at least some true positive is found.

XGBoost classifier
We used XGBoost (Extreme Gradient Boosting; Chen & Guestrin 2016).XGBoost found wide application in astronomy (Sesar et al. 2017;Shu et al. 2019;Spina et al. 2021).XGBoost is a scalable and efficient implementation of gradient boosting, which is an ensemble learning method that combines multiple weak decision trees to form a strong predictor.Decision trees are a supervised learning algorithm.They work by recursively splitting the dataset into smaller sub-groups based on the features that lead to the highest improvement in the prediction accuracy.The final result is a tree-like model where the internal nodes represent the decisions based on the input features, and the leaf nodes represent the prediction or outcome.
The decision trees in XGBoost are grown using a gradient-based optimization algorithm that aims to minimize a cost function that measures the prediction error.The final prediction is obtained by combining the predictions of all trees through a weighted sum, where the weights are learned during training.
In comparison to random forests, which also use an ensemble of decision trees, XGBoost grows trees sequentially and tries to correct the mistakes of previous trees at each step.This results in a stronger predictor, but with a higher computational cost (Chen & Guestrin 2016).Humans can effectively track of the decisions of only a few shallow decision trees at most (Freitas 2014).Therefore, all tree ensemble methods, unlike single trees, lack interpretability as soon as the number of trees grows enough to justify their use (Molnar 2022), with the lone exception of the recently introduced FIGS algorithm (Tan et al. 2022).
We trained an XGBoost classifier based on an ensemble of 1000 trees of maximum depth 2 on our set of six features.We held out 20% of our simulation data set for validation.Fig. 2 shows a precision-recall curve for our classifier on the test set.As discussed above, being based on a large ensemble of decision trees, our classifier is not readily interpretable: it is a black box.In Fig. 2 we compare its performance in terms of its precision-recall curve with that of a fully interpretable classifier trained on the same dataset, which we will discuss below.The left-hand side of Fig. 2 shows the precision-recall curves obtained in cross-validation, while the right hand side shows the final curve on the test set.

Explaining XGBoost classifier: Anchors
Ribeiro et al. ( 2018) introduced local, model agnostic explanation in the form of anchor rules as an improvement on their previous algorithm, LIME (Ribeiro et al. 2016).Anchors are decision rules -sequences of if-then logical statementsthat locally approximate a black box: one such rule list is learned for each data point for which an explanation is required.The scope of applicability of such a rule is explicitly defined in terms of the neighborhood of the relevant data point, leading to the definition of anchors as scoped rules.Over the data points to which the rule applies, i.e. those that satisfy the logical conditions, the black box prediction coincides with the prediction cast for the data point being explained, at least with a given frequency, termed precision.The fraction of points covered by the rule is termed coverage.Ribeiro et al. (2018) empirically show that anchors are readily understandable to humans when used as post-hoc explanations to a black box model.Anchors however have a few drawbacks: first, being a post-hoc explanation they capture the behavior of the underlying black box only locally and up to the chosen precision; second, they can get quite complex or have very low coverage, especially in the proximity of the decision boundary, which challenges either their usefulness or their effectiveness as human-understandable explanations.
For instance, an application for a loan could be rejected by a black-box decision algorithm; the relevant explanation could be a rule such as "you carry a balance on two credit cards or more, and your FICO score is below X"; this may Table 1.Rules found by CORELS during training in 5-fold cross-validation.The first column reports the number of folds in which a given rule was found.The corresponding rule is reported in the second column.

Number of folds Rule 4
Tr < 2000 Myr and R h < 3.0 pc and both L > 10 4 L⊙ and Rc/R h > 0.2 1 Tr < 5000 Myr and CSB > 10 3 L⊙/pc 2 and L > 10 4 L⊙ and Rc/R h > 0.2 not apply to other individuals, but is guaranteed to explain the majority of the relevant classification outcomes in a neighborhood of the instance for which we requested an explanation.In the vicinity of the decision boundary, finding such rules can become challenging.The intuition for this is that the decision is harder in such cases, making it harder to summarize the black box behavior in simple terms.

CORELS
Angelino et al. ( 2017) introduced the CORELS algorithm.CORELS produces concise rule lists based on binary features up to a user-specified depth.These optimal rules are found by optimizing a loss function on the set of all possible rule lists satisfying the given constraints.Unlike previously used methods based on greedy search (e.g.Rivest 1987), the advantage of CORELS is that it finds the optimal rule list given the constraints.Finding optimal rule lists of this kind is in general a computationally hard problem.CORELS solves it by reducing the search space of rule lists relying on recent mathematical results in discrete optimization, including those proved by the authors (Angelino et al. 2017) and in (Nijssen & Fromont 2010).It also introduces specialized data structures that keep track of intermediate computations and exploits symmetries to make the problem tractable.CORELS loss function is made up of two terms, the first one counting misclassifications plus an additional term penalizing the size (i.e.number of rules) of the rule list (see Eq. 1 in Rudin 2019).CORELS is still slower than traditional greedy approaches for learning decision trees, but it has been shown to be more effective than those on a variety of real-life datasets (see e.g.Rudin & Ertekin 2018).
The trade-off between rule accuracy and simplicity in CORELS can be adjusted by the user by changing a hyperparameter λ that determines the relative weight of the two terms of the loss.For instance, setting λ = 0.01 means that we would be indifferent to misclassifying 1% of our dataset if this means obtaining a rule list that contains one less rule.
Figure 2 shows the precision-recall curve of XGBoost on each of the five cross validation folds.On each fold we also trained a CORELS model.Both the performance of XGBoost and CORELS change across folds, but in the case of CORELS it is immediately evident why.The only fold in which CORELS learned a different rule from the others (corresponding to the blue dot in Fig. 2) achieves a much greater recall at the cost of a lowered precision.We show below that the rule learned by CORELS in this case is less strict in weeding out GCs that have a longer relaxation time.
It is also quite evident from Fig. 2 that the folds on which XGBoost performs best (green and black solid lines respectively), see also a higher performance by CORELS, which achieves the best (green dot) and second best (black dot) precision in each of them.
The rules found by CORELS are reported in Tab. 1 and shown graphically (in projection on relevant feature planes) in Fig. 3.They were learned on one of the cross-validation folds each.Remarkably, an identical rule is learned on four folds out of five.The rule learned on the remaining fold is still quite similar to it.All rules stipulate that an IMBH candidate must have an inflated core, in particular requiring that R c /R h > 0.2.This is immediately interpretable dynamically: a large core is produced by dynamical heating due to a binary containing the IMBH.The IMBH has a high probability of ending up in a binary system in the first place, because its large mass ensures that exchange interactions are favoured in IMBH-binary encounters.In addition, mass segregation makes it likely that the secondary is a stellar-mass black hole.The resulting binary system hardens through interactions with the other objects in the core, releasing energy that ultimately results in a bigger core.
All rules also demand that the total luminosity of a potential candidate exceeds 10 4 L ⊙ .This is a very low bar for GCs, whose typical luminosity is about one order of magnitude larger.This rule thus filters out GCs that are likely too small to be able to assemble an IMBH, either because they do not have a high enough density to trigger runaway mergers or because their escape velocity is too small to confine dark remnants that may eventually merge to form an IMBH.
High density is also a criterion for all rules, but it is enforced in one fold by a direct cutoff in central surface brightness, and in the other four by a condition on the half mass radius, i.e.R h < 3.0 pc.As shown in Fig. 3 this combines with the luminosity constraint common to all rules to select high density GCs.Finally, all rules require that the relaxation time be short, but the cutoffs differ, being set to either 2 Gyr (in four folds out of five) or to 5 Gyr (in the remaining one).This is likely the reason for the higher recall in the corresponding fold, as shown in Fig. 2. A short initial relaxation time corresponds to higher interaction frequency between stars, which results in faster mass segregation and makes both runaway mergers and stellar-mass black hole coalescences more likely.Relaxation times observed at 12 Gyr are essentially a proxy for initial relaxation times, which are unobservable (Trenti et al. 2010).Additionally, the swelling of the core due to an IMBH binary is observable only if the system is relaxed, because the associated dynamical heating requires time to take place.Thus a swollen core in the absence of a short relaxation time is an unreliable indicator of IMBH presence.

Rejecting out of distribution points
There is a potential mismatch between our simulated training data and actual GCs, which can lead to biased predictions.As we can see in Fig. 4, the data on which our models were tested and the real clusters are not perfectly matched.It is thus crucial to assess how far out-of-distribution the GCs we cast predictions for are with respect to the training data.A multitude of out-of-distribution detection approaches have been discussed in ML literature (see Yang et al. 2021, for a recent review).In our case, we stuck to a simple density-based approach, which is generally recognized as effective in low-dimensional feature spaces.For this reason, we utilized KDE.KDE is a non-parametric technique for estimating the probability density function (PDF) of a random variable from independent, identically distributed samples.The basic idea behind KDE is to place a smooth, symmetric kernel function at each data point, and then sum these kernel functions to estimate the underlying PDF.While other approaches to density estimation exist, such as e.g.Gaussian mixture models (McLachlan & Basford 1988), mixture density networks (Bishop 1994), and autoregressive flows (Huang et al. 2018), KDE is a simple and time tested method that works well in low dimension.We used a simple Gaussian kernel KDE relying on the scikit-learn package (Pedregosa et al. 2011).We fitted the density estimation on the training set and evaluated on both the test set and the real data set.The distributions of the KDE score are shown in Fig. 4. Low KDE score corresponds to weird -i.e.unlikely, out of distribution-data points with respect to the training set.We calculated the KDE scores for the points in our test set (randomly extracted from the simulation data set) and used its first decile as a cutoff to reject the out-of-distribution data points among the actual GCs.This way we remove from the pool of IMBH host candidates those whose prediction would require us to trust our models in regions where the data on which they were trained is scarce.

RESULTS
In this study, we compared a baseline black-box method (XGBoost) to a fully interpretable alternative method (CORELS) for classifying GCs.We obtained a comparison between the two methods in terms of performance given the difference in interpretability.To this end, we utilized both methods on the simulated GCs (explained in Section 2) and on real GCs as well.Figure 2 shows the precision-recall curve of this classifier on our test set.Additionally, we applied CORELS, to the same data set, obtaining a precision (recall) of 0.89 (0.46).Since CORELS returns a hard classification, unlike XGBoost, we did not obtain a precision-recall curve for CORELS, but only single points in the precision-recall plane.The rules comprising the rule list learned by CORELS are visualized in Figure 4.
Furthermore, we used a KDE approach to measure how much the real data deviated from the test data distribution.We flagged real GCs as out of distribution if they were below the first decile of the KDE score.Lastly, we present a table of GCs that are predicted to be IMBH hosts both by CORELS and by XGBoost and also are in-distribution based on our KDE criterion.This is shown in Tab. 2 alongside with the upper limits on IMBH mass obtained for each candidate by T18 (if any are present) and with any reference we found presenting evidence of IMBH presence or absence in the candidate.Candidate selection for this table is summarized by Figs. 5 and 6.

Physical interpretation of the CORELS rule list and of the anchors
To be classified as an IMBH host by CORELS, a GC has to meet the following conditions: T r < 2 Gyr, R h < 3.0 pc, L > 10 4 M ⊙ , and R c /R h > 0.2.The first rule selects dynamically relaxed GCs.The second and third rules, in combination, correspond to a condition on cluster density.The last rule selects GCs whose core is inflated with respect to the overall size of the cluster.All of these rules can be understood in terms of IMBH formation physics: mass segregation (which brings heavy stars and compact remnants to the cluster centre where they can merge to give rise to an IMBH) happens on a timescale proportional to the two-body relaxation time.The typical age of a GC is of the order of ≈ 12 Gyr, so clusters with relaxation times of 2 Gyr or shorter had ample time to undergo mass segregation.
High density (captured by the requirement to have a small radius for a minimum mass) is instrumental in increasing the frequency of close encounters that can result in IMBH formation.Finally, a dynamically puffed-up core was identified as a sign of IMBH presence early on (Baumgardt et al. 2005;Hurley 2007;Heggie et al. 2007), as IMBHs tend to undergo dynamical exchange reactions with stellar-mass black hole (BH) binaries, forming an IMBH-BH binary that is confined to the core by its heavy mass and releases energy in the core through close encounters, heating it up.Additionally, close interactions with the IMBH can also result in stars merging with it, or they can also get ejected out of the central part of the cluster due to strong encounters.These effects contribute in depleting stars from the central part of the cluster.
There is some overlap between the CORELS rules and the anchors listed in Tab. 3, suggesting that the latter may also have a similar physical interpretation.However, the anchors explaining the behavior of XGBoost often include a large number of features (even all of them at times) and a disparate array of cutoffs, making a straightforward interpretation harder.These are still a potentially useful tool for investigating a specific candidate in the process of planning a follow-up observation.

SUMMARY
In the recent years, we witnessed an explosion of ML tools in astronomy.Still, most contributions avoid or only briefly touch upon the issue of interpretability and explainability of the models they adopt.In this paper, we distinguish Table 3. Anchors explaining XGBoost decision for real GCs.The first column reports the GC name, the second column the anchor as a list of conditions (each feature has been given a different colour for ease of comparison), the third column reports the anchor's precision and the fourth column its coverage.between the two approaches, with interpretability meaning that a model is natively simple enough that a human can understand its inner workings and explainability meaning that a complex or otherwise opaque model is explained post-hoc in terms that a human can understand.This is a terminological distinction that has not yet been received by the astronomical community, and the trickle of works that address this issue at some level of depth often conflates the two, using interpretability as an umbrella term.With this in mind, in the present work, we have applied both kinds of techniques, comparing a natively interpretable model -CORELS-with a black-box-plus-explanation model (XGBoost explained using the local model agnostic explanation rules known as anchors) on the problem of IMBH detection in GCs.We trained and validated both models on features measured on GC simulations, and deployed the models in prediction on actual GCs.The features we selected are standard, easily measurable characteristics of GCs and are in fact readily available in multiple catalogues.The meaning of each feature is clearly understood in theoretical terms by astronomers.
We produced a list of GCs that are likely to contain an IMBH according to both models, presenting for each one the anchor rule, the confidence score from XGBoost, and a measure of how far out of distribution each actual GC is with respect to the training data.We compare the rule list learned from the intepretable CORELS model to the anchor rules and conclude that there is a qualitative resemblance but the latter are usually more complex.Moreover, the rule list learned by CORELS has an immediate physical interpretation, some aspects of which are shared with a subset of the anchor rules.
We evaluated the performance of both CORELS and XGBoost on a test set comprised of simulations not seen in training.We find that the CORELS model we learned, based on a rule list comprised of four rules, achieves 90% precision at ≈ 50% recall, whereas the precision achieved by XGBoost at the same recall is nominally perfect (100%).However, this comes at the cost of relying on a black box, which is only moderately mitigated by post-hoc explanations.
Despite the simulator's best efforts, our training data and actual GCs are characterized by distributions that do not perfectly overlap in feature space.This raises the issue of whether we can trust the predictions of models trained on simulations and deployed on observational data.In this context, understanding the inner workings of a model provides a crucial advantage: we can decide whether to trust it based on how its behavior matches our subject-matter knowledge, i.e. the physics of IMBH formation.Still, as an additional safeguard, we removed from our final candidate list GCs that were too far out of distribution with respect to the training data.This was achieved by estimating the density of training data in feature space with KDE and placing a relevant cutoff.This procedure also implicitly yields a measure of the realism of the simulations we considered, with respect to the actual observational data, and taken as a whole.The predictions cast by our models with respect to actual GCs cannot be directly tested, since we do not know which GCs host an IMBH.However, they can be used as guidance for further investigation, knowing that they are based on an interpretable model which is not working in extrapolation.
ERC Consolidator grant DEMOBLACK, under contract no.770017 and from the German Excellence Strategy via the Heidelberg Cluster of Excellence (EXC 2181 -390900948) STRUCTURES.AA acknowledges support for this paper from project No. 2021/43/P/ST9/03167 co-funded by the Polish National Science Center (NCN) and the European Union Framework Programme for Research and Innovation Horizon 2020 under the Marie Skłodowska-Curie grant agreement No. 945339.For the purpose of Open Access, the authors have applied for a CC-BY public copyright licence to any Author Accepted Manuscript (AAM) version arising from this submission.AA also acknowledges support from the Swedish Research Council through the grant 2017-04217 and from NCN through the grant UMO-2021/41/B/ST9/01191.

Figure 1 .
Figure 1.Pair plot showing our GC simulations in feature space.IMBH hosts are shown as orange points and non-hosts as blue points.In order, the axis are: Half-mass relaxation time (HRT), total luminosity (TVL), central velocity dispersion (CVD), central surface brightness (CSB), core radius (OCR), and half-light radius (OHLR).All the reported values are at 12 Gyr.

Figure 2 .
Figure 2. Left-hand panel: precision recall curve for our XGBoost classifier on each cross-validation fold (solid lines).The dots represent the recall and precision achieved by a CORELS classifier trained on the same fold.The colours identify the fold for both XGBoost and CORELS.Right-hand panel: Precision recall curve for our XGBoost classifier on the test set.The red dot represents the recall and precision achieved by the CORELS classifier on the same test set.In both panels CORELS yields a single point rather than a curve because it produces a fixed rule list leading to a hard class label.

Figure 3 .
Figure 3.The rules found by CORELS, displayed on the planes corresponding each to the relevant couple of features.The white area corresponds to the region that respects the rule.The red points correspond to simulated GCs in the training set that host an IMBH and the blue points correspond to non-hosts.

Figure 4 .
Figure 4. Upper panel: pair plot showing test data (a representative sample of the simulations, shown in blue) versus real GC data (orange).In order, the axis are: Half relaxation time (HRT), total Luminosity (TVL), central velocity dispersion (CVD), central surface brightness (CSB), core radius (OCR), and half-light radius (OHLR).Lower panel: distribution of the KDE score for test data (orange) and actual GC data (blue).The vertical line is the cutoff adopted for inclusion in the final sample of candidates, which corresponds to the first decile of the KDE score for test data.

Figure 5 .
Figure 5. Candidate selection shown in the space of relaxation time, total luminosity, log half-light radius, and log core over half-light radius.The gray squares represent the training data set, the black dots the data from actual GCs compiled by Baumgardt & Hilker (2018).GCs selected as IMBH host candidates only by XGBoost (CORELS) are shown in blue (red), and candidates selected by both models are shown in purple.

Figure 6 .
Figure 6.Euler-Venn diagram summarizing the inclusion rules for GCs in the final candidate list.The dark gray set includes all GCs who fall in-distribution with respect to the simulations according to the criterion based on KDE we discussed in the text (112 in total); the light gray set includes the XGBoost candidates (88 in total); and the brown set the CORELS candidates (29 in total).The clusters in the mutual intersection of all three sets are included in the final candidate list (24 in total).

Table 2 .
Final table of IMBH candidate hosts.The first column reports the cluster name, columns 3-8 report the (rounded, log) values of the relevant features.Columns 9-11 show the classification prediction by XGBoost, CORELS and the in-distribution criterion based on KDE and column 12 shows the confidence of the classification by XGBoost, based on which the rows are ordered in decreasing order.Column 13 reports the IMBH upper mass limit based on radio observations by T18.The last column lists any previous claims or other relevant literature discussing IMBHs in the cluster.The full table, including non-hosts, is presented in Appendix A.