Identifying Galaxy Mergers in Simulated CEERS NIRCam Images Using Random Forests

Identifying merging galaxies is an important—but difficult—step in galaxy evolution studies. We present random forest (RF) classifications of galaxy mergers from simulated JWST images based on various standard morphological parameters. We describe (a) constructing the simulated images from IllustrisTNG and the Santa Cruz SAM and modifying them to mimic future CEERS observations and nearly noiseless observations, (b) measuring morphological parameters from these images, and (c) constructing and training the RFs using the merger history information for the simulated galaxies available from IllustrisTNG. The RFs correctly classify ∼60% of non-merging and merging galaxies across 0.5 < z < 4.0. Rest-frame asymmetry parameters appear more important for lower-redshift merger classifications, while rest-frame bulge and clump parameters appear more important for higher-redshift classifications. Adjusting the classification probability threshold does not improve the performance of the forests. Finally, the shape and slope of the resulting merger fraction and merger rate derived from the RF classifications match with theoretical Illustris predictions but are underestimated by a factor of ∼0.5.


INTRODUCTION
Mergers are known to play an important role in the evolution of galaxies over cosmic time.The gravitational interactions between merging galaxies compress gas and create shocks, inducing star formation throughout, and can funnel gas toward their centers, powering nuclear starbursts and fueling active galactic nuclei (AGN) (e.g., Sanders et al. 1988;Mihos & Hernquist 1996;Hopkins et al. 2008).This process can also disrupt the orderly rotation of disk stars, driving the morphological transition of galaxies by turning spiral disk galaxies into ellipticals (e.g., Toomre 1977;Cox et al. 2006;Kormendy et al. 2009;Rodriguez-Gomez et al. 2017) as well as inducing disk-instabilites that may cause the build up of the most massive structures at z > 3 (e.g., Tacchella et al. 2015;Zolotov et al. 2015;Costantin et al. 2021Costantin et al. , 2022)).We now know that the rate at which mergers occur evolves strongly out to z ∼ 1.5, as seen by many observational studies as well as cosmological simulations (e.g., Kartaltepe et al. 2007;Bridge et al. 2010;Lotz et al. 2011;Rodriguez-Gomez et al. 2015;Mantha et al. 2018) and that interactions and mergers can have a large impact on the star formation rates and AGN activity of galaxies (e.g., Ellison et al. 2008;Patton et al. 2011;Larson et al. 2016).
Studies of the merger rate during cosmic noon (z ∼ 1 − 3) have benefited from deep NIR surveys of galaxies with the Wide Field Camera (WFC3) on the Hubble Space Telescope (HST), though many uncertainties in the observations and tension with expectations from simulations remain.At these higher redshifts, the empirical merger rates of Lotz et al. (2011) andO'Leary et al. (2021) and the theoretical rates of Hopkins et al. (2010) and Rodriguez-Gomez et al. (2015) continue to increase (albeit at different rates).On the other hand, Man et al. (2016) find that their empirical merger rate flattens, while Mantha et al. (2018) find that their empirical rate either turns over and begins decreasing, or remains increasing, depending on the criteria used for their merger selection.Duncan et al. (2019) compare their observed merger rates from the Cosmic Assembly Near-infrared Deep Extragalactic Legacy Survey (CANDELS; Koekemoer et al. 2011;Grogin et al. 2011) to previous studies.They find that for galaxies with log 10 (M /M ) > 10.3, their increasing merger rate agrees with that measured in the Illustris simulation by Rodriguez-Gomez et al. (2015) out to z ∼ 6, as well as with Mundy et al. (2017) out to z ∼ 2. However, beyond z ∼ 2, Duncan et al. (2019) disagrees with the rate from Mantha et al. (2018) which begins decreasing.For galaxies with 9.7 < log 10 (M /M ) < 10.3, the increasing rate of Duncan et al. (2019) agrees with that of Ventou et al. (2017) out to z ∼ 3 (within their error bars), yet disagrees with the shallower rate from Illustris, indicating that the Illustris merger rates are in tension with observations.Furthermore, Duncan et al. (2019) find that their comoving merger rates suffer from significant uncertainties at z > 4. Additionally, Ferreira et al. (2020) identify mergers using deep learning and find that the rates broadly agree with visual classifications out to z ∼ 3.
There are several sources of uncertainty that currently plague our understanding of mergers at these high redshifts.First, modern surveys with WFC3 are only sensitive to the most massive galaxies and major mergers (mass ratio <4:1; e.g., Ellison et al. 2013;Mantha et al. 2018) and the numbers detected drop-off sharply beyond z > 3.At these redshifts, it is expected that low mass galaxies and minor mergers (mass ratios between 4:1 and 10:1) may play an increasing role (e.g., Kaviraj 2014).Furthermore, the conversion of an observed merger fraction into a merger rate requires the assumption of a merger time scale and how it might evolve with redshift.This timescale itself is highly uncertain and relies on information from simulations (Snyder et al. 2017;Duncan et al. 2019).Finally, identifying mergers from images in the first place poses many of its own challenges.
Typically, there have been two methods employed to identify merging systems: 1) identifying close pairs of galaxies that are likely to merge at some point in the future and 2) identifying advanced mergers through morphological disturbances, such as double nuclei, tidal tails, and other asymmetries.The close pair method finds galaxy pairs that are close on the sky and in redshift, using either spectroscopic redshifts (e.g., Lin et al. 2004;Tasca et al. 2014;Ventou et al. 2017;Shah et al. 2020) or a sophisticated analysis of photometric redshifts (e.g., Kartaltepe et al. 2007;Mantha et al. 2018;Duncan et al. 2019).Identifying galaxy mergers through morphological features can be done via visual classifications (e.g., Lintott et al. 2008;Kartaltepe et al. 2015), and is typically robust since the human eye is skilled at picking out patterns and faint features in noisy images.However, visual classifications can be both subjective and time-consuming, especially for large surveys.
Quantitative parameters such as CAS, G, M 20 , and the M ID statistics (e.g., Conselice 2003;Lotz et al. 2004;Freeman et al. 2013) were developed as a less subjective alternative to visual classifications.
In populations of low redshift galaxies, these quantitative morphology parameters have been shown to effectively locate mergers, since they can be correlated with features like asymmetries, multiple cores, starbursts, and tidal tails that are caused by merging (e.g., Conselice 2003; Lotz et al. 2004Lotz et al. , 2008b;;Freeman et al. 2013;Wen et al. 2014;Snyder et al. 2015;Pawlik et al. 2016).However, for high redshift galaxies, these parameters become less effective at classifying mergers (e.g., Lotz et al. 2004;Conselice et al. 2008;Kartaltepe et al. 2010;Thompson et al. 2015;Snyder et al. 2015).The main reason for this is simply that at high redshifts, mergers are more difficult to see -cosmological surface brightness dimming leads to the loss of low surface brightness features, poor angular resolution leads to the blurring of small scale structure, and rest-frame optical emission is shifted to the near-and mid-infrared, which means optical, and short-wavelength instruments like ACS and WFC3 on HST are unable to observe the structure of the general stellar population.Furthermore, high redshift galaxies can have intrinsically more clumpy and irregular morphologies than those at low redshifts, which could masquerade as merger signatures (e.g., Dekel et al. 2009;Kartaltepe et al. 2012Kartaltepe et al. , 2015)).
Therefore, high quality, deep, high-resolution nearinfrared imaging is needed to detect merger features at high redshifts and subsequently curate an accurate sample set of high redshift galaxy mergers.The state-of-theart and recently launched James Webb Space Telescope (JWST) will provide the highest quality images of distant galaxies to date.One of the first deep extragalactic surveys that JWST will undertake is the Cosmic Evolution Early Release Science (CEERS) Survey (PI: S. Finkelstein).CEERS utilizes JWST imaging and spectroscopy in parallel over 100 arcmin2 in the Extended Groth Strip HST legacy field for 0.5 < z < 13 galaxies.CEERS uses JWST's near-infrared camera (NIRCam) to probe structural features at higher redshifts than has been possible with HST (Finkelstein et al. 2017).Figure 1 compares the same simulated z = 3 galaxy as seen with HST/ACS + WFC3 and with JWST/NIRCam using the CEERS observing strategy, at similar wavelengths.The improved resolution of the JWST CEERS images will result in much more accurate morphological measurements than the HST images.
In addition to using deeper, high resolution NIR imaging from JWST, new analysis techniques can be employed to better identify the morphological signatures of mergers.In particular, machine learning methods show great promise for identifying high redshift mergers.At low redshift, a number of studies have already used machine learning for morphological classifications (e.g., Huertas-Company et al. 2015;Peth et al. 2016;Sreejith et al. 2018;Bottrell et al. 2019;Nevin et al. 2019;Pearson et al. 2019;Cheng et al. 2020, Guzman-Ortega et al., in preparation).For merger classifications in particular, Nevin et al. (2019) note that their machine learning method outperforms the classical automated identification methods (e.g., Conselice 2003;Lotz et al. 2004) in the nearby universe.Recently, studies have begun to explore using machine learning to identify high redshift mergers (Snyder et al. 2019;Ćiprijanović et al. 2020;Ferreira et al. 2020Ferreira et al. , 2022;;Sharma et al. 2021).In particular, Snyder et al. (2019) use random forests to identify 0.5 < z < 4 mergers from Illustris-1 and find that their classifier achieves a true positive rate of up to ∼ 70% for mergers in simulated HST images.They also note that their classifier generally outperforms the more simple classifiers à la Conselice (2003) and Lotz et al. (2004) across their redshift range.
In this paper, we explore the use of random forests in identifying galaxy mergers from simulated CEERS images from IllustrisTNG in preparation for new JWST observations.In §2, we describe the simulated JWST images used in this work as well as how those images were modified to create one set that matches CEERS NIRCam observations and other effectively noiseless sets for comparison purposes.In §3, we describe the large set of quantitative morphology parameters used as input to the forests and how they were calculated from the data.In §4 and §5, we present our random forest analyses and discuss the results.We summarize and conclude in §6.
CEERS is expected to reveal objects with irregular and perturbed morphologies in great detail, which will allow astronomers to track the structural evolution of galaxies from z ∼ 7 to today.Since CEERS is targeting the EGS HST field, CEERS data will also be accompanied by a rich set of multi-wavelength data from HST ACS (Davis et al. 2007), HST WFC3 (Koekemoer et al. 2011;Grogin et al. 2011;Momcheva et al. 2016), Spitzer (Ashby et al. 2013), Herschel (Lutz et al. 2011;Oliver et al. 2012), Chandra (Laird et al. 2009;Nandra et al. 2015), and a number of ground-based imaging and spectroscopic surveys (e.g., Coil et al. 2004;Cooper et al. 2012;Newman et al. 2013;Kriek et al. 2015;Masters et al. 2019) with high-quality photometric reshifts and stellar masses (Stefanon et al. 2017).JWST launched in December 2021, with the first CEERS images observed in June 2022 and released in July 2022, including 4 out of the 10 NIRCam pointings that make up the complete CEERS NIRCam mosaic in the EGS field.The remaining 6 pointings will be observed in December 2022.

The Simulated Images
We construct mock images of the EGS field from IllustrisTNG100-1, one of three cosmological simulations of galaxy formation and evolution included in the IllustrisTNG suite (Springel et al. 2018;Naiman et al. 2018;Nelson et al. 2018;Pillepich et al. 2018a;Marinacci et al. 2018).Compared to Illustris-1, IllustrisTNG uses several updated prescriptions for physical processes (such as magnetic fields, black hole feedback, and galactic winds) as detailed by Weinberger et al. (2017) and Pillepich et al. (2018b).TNG100-1 has been shown to produce galaxy morphologies in good agreement with observations (e.g., Rodriguez-Gomez et al. 2019;Tacchella et al. 2019).
To construct simulated images, we first adopt a simulated lightcone with footprint overlapping the observed EGS field, constructed using dark matter halos from an n-body simulation and the Santa Cruz semi-analytic model (SAM) for galaxy formation (Somerville et al. 2021;Yung et al. 2022).The Santa Cruz SAM tracks a wide variety of baryonic processes using prescriptions derived analytically, inferred from observations, or extracted from numerical simulations, and provides physically-backed predictions for galaxies across wide ranges of redshift and mass.This model has been shown to be able to reproduce the observed evolution in distribution functions of rest-frame UV luminosity, stellar mass, and SFR from z ∼ 0 to the highest redshift where observational constraints are available (Somerville et al. 2015;Yung et al. 2019a,b).
Then, for each galaxy in the SAM, we identify Il-lustrisTNG subhalos from the TNG100-1 simulation by searching in the space of halo mass versus star formation rate.We randomly choose matching subhalos from those within a factor of two in these dimensions from the SAM galaxy catalogue.We then create simulated images of each subhalo using the public visualization API (Nelson et al. 2019a) in each of the JWST NIRCam F115W, F150W, F200W, F277W, F356W, and F444W filters, and add them together to form the wide-area mock images.These images cover an area of ∼100 square arcmin, to mimic the size of the EGS field that CEERS will cover, and contain over 100,000 galaxies.The pixel scale of the images is 0.03 arcsec/pix.These images are accompanied by catalogs with intrinsic information such as redshift, star formation rate, and stellar mass.Additionally, the merger history catalogs for IllustrisTNG galaxies (Rodriguez-Gomez et al. 2015;Nelson et al. 2019b) used in this work give the IllustrisTNG snapshot numbers for each galaxy's most recent past merger and next future merger (both major and minor), as well as the total number of past mergers experienced in the galaxy's history for different timescales.The merger history information available for these galaxies makes IllustrisTNG an ideal dataset for training and testing machine learning algorithms to identify galaxy mergers at different stages.

Modifying the Simulated Images
We start with the pristine simulated images, and convolve them with the PSF of each filter, which are the model PSFs from WebbPSF (Perrin et al. 2014).We then add Poisson noise (due to the galaxies in the image) following the formulation of Pontoppidan et al. (2016), where each image is convolved with a kernel to sum fluxes in neighboring pixels.We then add background noise to represent the actual CEERS exposure times (see Table 1), which was estimated using the exposure time calculator (ETC) system developed for JWST (Pontop- Table 1.JWST ETC detector setup and resulting exposure times for each set of images.The CEERS setup here is described in the original CEERS proposal (Finkelstein et al. 2017).This setup has since changed to MEDIUM8 with 9 groups times 3 exposures for the actual CEERS observations.However, the final exposure times are similar and should not affect our results.
pidan et al. 2016).Last, the Python package photutils is used to estimate the average background in each image, which is then subtracted from the images resulting in a final set of CEERS-like, background subtracted images for each of the six NIRCam filters.Figure 2 shows examples of simulated mergers at each step of the image modification process.We similarly create a set of effectively noiseless images, using an extremely long exposure time of 11 days in the ETC (see Table 1), so that we can compare measurements from our CEERS-like images to those from the noiseless ones.This step is necessary because a purely noiseless set of images causes the Galapagos-2 code (see §3) to output unusually large errors, and therefore true noiseless images cannot be used for our analysis.Examples of nearly noiseless galaxies are shown in column E of Figure 2. Finally, the headers of each image are amended to include the keywords required by Galapagos-2.
All of our final CEERS-like and noiseless simulated NIRCam images are available to the public via https://ceers.github.io/releases.html.

QUANTITATIVE MORPHOLOGY PARAMETERS
This work makes use of several different morphology parameters.Parametric measurements assume that the galaxy's light distribution follows a specific mathematical profile, such as the Sérsic profile, specified by the Sérsic index n (Sérsic 1963) and other parameters.Nonparametric measurements do not assume an underlying mathematical form, but rather are statistical measures of the light distribution in a galaxy (e.g., the CAS parameters; Bershady et al. 2000;Conselice et al. 2000;Conselice 2003).Sérsic parameters were calculated using the IDL program Galapagos-2 from the MegaMorph Project1 .MegaMorph is a project designed to improve astronomers' ability to measure the structure of galaxies via parametric methods while making full use of modern multiwavelength imaging surveys (Bamford et al. 2011;Häußler et al. 2013;Vika et al. 2013).Using multiwavelength information allows one to constrain fit parameters that vary smoothly as a function of wavelength, which produces more physically consistent models.Under the MegaMorph Project, Galapagos2 (Barden et al. 2012) was modified to create Galapagos-2, and Galfit3 (Peng et al. 2002;Peng et al. 2010) was modified to create GalfitM.
GalfitM is a least-squares fitting algorithm that finds the optimum solution to the Sérsic fit for a galaxy (Peng et al. 2002;Peng et al. 2010;Peng 2012).Galapagos (Galaxy Analysis over Large Areas: Parameter Assessment by Galfit-ting Objects from SExtractor) is essentially an IDL wrapper routine that allows GalfitM to be used for large survey images.In addition to the final output catalog with the Sérsic fit parameters, Galapagos-2 also outputs the original stamp, the GalfitM model, and the residual image for each galaxy detected in the survey images.Galapagos-2/GalfitM can also do bulge-disk decomposition and output separate Sérsic parameters for both galaxy components.
Galapagos-2 uses the Source Extractor program for object detection (Bertin & Arnouts 1996).Galapagos-2 first runs Source Extractor in "cold" mode to deblend nearby galaxies, then in "hot" mode to detect faint galaxies.The "cold" and "hot" catalogues are then combined.Table 2 lists our "cold" and "hot" mode parameters for both sets of images.
The Python package statmorph4 was used for calculating nonparametric morphology measurements as well as single Sérsic fits (Rodriguez-Gomez et al. 2019).The inputs to statmorph are the science image, the segmentation map created by Source Extractor, the PSF, and the gain.While statmorph can handle both single galaxy images or survey images, it cannot handle multiple images at once to make full use of multiwavelength data.The morphology measurements from statmorph include: concentration (C), asymmetry (A), and clumpiness/smoothness (S) (Bershady et al. 2000;Conselice et al. 2000;Conselice 2003); the Gini coefficient (G) and the moment of light (M 20 ) (Abraham et al. 2003;Lotz et al. 2004); multimode (M ), intensity (I), and deviation (D) (Freeman et al. 2013); and outer asymmetry (A O ) and shape asymmetry (A S ) (Wen et al. 2014;Pawlik et al. 2016), as well as various parameters such as radii (e.g., r 20 , r 50 , r 80 , r half , r petro ) and signalto-noise per pixel.
We also make use of the residual images output by GalfitM.Residual images have been shown to have the potential to highlight asymmetric or unusual structures not captured by the Sérsic model (e.g., Mantha et al. 2019).We run statmorph on all residual images to obtain residual morphology measurements.Since statmorph requires that images have positive flux, which was not true for all residual images, we add an offset of +1 to every pixel of all residual images.This addition does affect statmorph's measurements, but will be consistent across all residual images, both mock CEERS and nearly noiseless.

MERGER IDENTIFICATION
As described in §1, machine learning techniques are potentially more advantageous for high redshift merger identification compared to classical techniques, due to their ability to exploit complex multidimensional information.Here, we choose to explore the random forest technique (Ho 1995;Breiman 2001), which we describe in §4.4.Prior to implementing random forests, we first prepare the morphology catalog that will be the features input to the forests ( §4.1) and define the merger and non-merger classes that will be the labels input to the forests ( §4.2).We also show an example of the performance of a classical method using our mock CEERS dataset, which further illustrates the need for machine learning merger identification ( §4.3).

Catalog Creation
The IllustrisTNG merger history catalog (Rodriguez-Gomez et al. 2015;Nelson et al. 2019b) contains Illus-trisTNG snapshot numbers for each galaxy, as well as snapshot numbers for the most recent and next merger events, for both major and minor mergers.We convert the snapshot numbers to ages of the universe in Gyr, then calculate the difference in time between the galaxy of interest and the last and next merger events.The resulting merger history catalog therefore contains the time since the last merger (both major and minor) and the time until the next (both major and minor).
After running Galapagos-2 and statmorph, we combine these catalogs with the merger history catalog using match coordinates sky() from the astropy Python package.The final master catalog contains 109,395 galaxies, although only ∼70,000 are detected by Source Extractor in the simulated CEERS images and subsequently have morphology measurements.The final catalog therefore contains columns for intrinsic information such as redshift and merger timescales, as well as the Galapagos-2 and statmorph measurements.
For our analysis, we restrict our mock CEERS dataset to have statmorph signal-to-noise per pixel (S/N F 115W ) > 3, as well as Flag Galfit = 2.0 which indicates successful Galfit measurements.We also restrict our dataset to only galaxies with 0.5 ≤ z ≤ 4.0 to match the redshift range chosen in Snyder et al. (2019).Figures 3  and 4 illustrate some basic properties of galaxies in the full original IllustrisTNG dataset compared to the galaxies detected in our mock CEERS-like images.Figure 3 shows that the full redshift range extends from z = 0.5 up to 10, with only a few galaxies existing in the highest redshift bins.It also shows that the stellar masses in our z < 4 mock CEERS span a range from log(M /M ) ∼ 5 − 12, but there are far fewer low mass objects in the mock CEERS set than in the original IllustrisTNG set.All of the lowest mass galaxies (both original IllustrisTNG and mock CEERS) are in the lower redshift bins.Figure 4 shows star formation rate as a function of stellar mass, divided into the redshift bins used in this work.The IllustrisTNG data (and mock CEERS data) follows the main sequence fits of Whitaker et al. (2014) and Schreiber et al. (2015), with a notable lack of starbursts lying above the main sequence.
For comparison with the effectively noiseless images, we make two slightly different datasets.The first dataset (EN Set 1) used statmorph S/N F 115W > 3 and Flag Galfit = 2.0 cuts based on the effectively noiseless data, which captures fainter objects not seen in the mock CEERS dataset.The second dataset (EN Set 2) used statmorph S/N F 115W > 3 and Flag Galfit = 2.0 cuts based on the mock CEERS data, however, the effectively noiseless morphology measurements were still used as inputs to the forests.This was done to directly compare the same objects in both the CEERS-like and effectively noiseless images.Table 3 lists the sizes of each dataset used in this work.Note that there are fewer objects in EN Set 2 than in the mock CEERS set.
Since the CEERS-like images and the nearly noiseless images are different images with different noise properties, Source Extractor will not make the exact same detections in both, and may over deblend or under deblend objects in one set of images but not the other.Additionally, Galapagos-2 and statmorph may flag an object with "bad" measurements in one set but not the other.Therefore, we are unable to perfectly match galaxies in the nearly noiseless catalog to those in the mock CEERS catalog, and lose some galaxies due to the aforementioned issues.

Merger Definition
We create labels for the random forest algorithm using the time since and time until a major and minor merger.Following Snyder et al. (2019), we combine major and minor mergers to increase our training set size.For a binary classification scheme, galaxies that had never experienced a merger or never will experience a merger (denoted as −1.0 in the final catalog) were labeled as Class 0 ("non-merger").Also following Snyder et al. (2019), we choose a timescale window of 500 Myr (±250 Myr) for our merger class definition since the lifetime of merger features will likely not be longer.Therefore, galaxies that have experienced a merger greater than 250 Myr ago and will experience a merger greater than 250  Myr in the future are also assigned to Class 0. Galaxies that experienced a merger within 250 Myr, past or future, were assigned to Class 1 ("merger").As a check, we shift the merger definition to include windows from 100 Myr to 500 Myr in the past or future (to include more pre-and post-mergers), but do not see a significant improvement in the performance of the forests.We also test using a three-class classification scheme with "non-merger," "past merger," and "future merger," but the forests perform poorly in these trials, most likely due to low numbers in each merger class.
Figure 5 shows the four merger timescales -past major and future major (top panel) and past minor and future minor (bottom panel) -for each galaxy in our mock CEERS set compared to the full IllustrisTNG dataset.The red shaded region shows that the selected ±250 Myr window spans a relatively narrow range of the full timescale distributions.

Performance of Classical Methods
We compare the merger classification performance of machine learning techniques with the performance of classical methods, such as the G − M 20 parameter space (Lotz et al. 2004(Lotz et al. , 2008b)), in order to judge if machine learning provides any improvements.Figure 6 shows the F277W observed (rest-frame optical) G − M 20 parameter space for objects in the mock CEERS 3.0 < z < 3.5 redshift bin.The merger discriminating line is as defined by Lotz et al. (2008b).True mergers, according to the merger definition in §4.2, occupy the same space as non-mergers.In this redshift bin, the fraction of correctly classified mergers, according to Equation 1, is only ∼ 19% since most true mergers lie below the merger discriminating line.This is to be expected since this method is very sensitive to merger stage, and is best at selecting mergers just after first passage (Lotz et al. 2008a).Of the predicted mergers (objects above the merger discriminating line), only ∼ 48% are actually true mergers.If we choose the F356W filter for our observed filter, the results are the same.Across all redshift bins, the number of correctly classified mergers ranges from ∼ 19% − 23%.The number of predicted mergers that are actually true mergers ranges from only ∼ 4% at the lowest redshift bin to ∼ 50% at the highest redshift bin, a consequence of the increasing number of mergers at higher redshifts.
These results illustrate the poor performance of classical methods for identifying mergers at a range of stages at high redshift, and motivates the use of machine learning techniques for improving the level of completeness.

Random Forest Experiments
The random forest (RF) algorithm is a supervised classification algorithm consisting of many decision trees (Ho 1995;Breiman 2001).A single decision tree is a flowchart-like diagram where each split (or "node") in the tree represents a decision made based on the input features.The "terminal nodes" at the end of the tree represent the possible classifications.The algorithm uses an ensemble of decision trees to minimize overfitting from any one tree.We choose to use random forests due to their simplicity and because Snyder et al. (2019) demonstrated that random forests show some promise at high-z galaxy merger classification tasks.
Table 3 shows our dataset sizes after cleaning the dataset and defining the merger class.We split the data into training and test sets, with a training fraction of 0.67, where the ratio of objects are preserved for each class.We use BalancedRandomForestClassifier() from Python's imblearn package, which is specifically designed for imbalanced data sets.It works by deliberately undersampling the majority class during training.The morphology parameters we feed to the forests are the single Sérsic index n, the twocomponent Sérsic indices n bulge and n disk , and the nonparametric A, C, G, I, m 20 , M, A O , D, and S, in all six filters.We also feed to the forests the non-parametric A, C, G, I, m 20 , M, A O , D, and S as calculated from the residual images, in all filters.
We run keras-tuner for hyperparameter optimization, which uses Bayesian optimization to search the parameter space and find the optimal combination of hyperparameters without having to test all possible combinations (O'Malley et al. 2019).Generally, keras-tuner will find the most optimal hyperparameters in less time than other hyperparameter tuning algorithms such as GridSearchCV.The hyperparameters that we let vary are "max samples" (0.9 -0.99 with a step size of 0.1), "max features" (1 -15 with a step size of 1), "max leaf nodes" (5 -55 with a step size of 1), and "n estimators" (1000 -2000 with a step size of 50).We allow keras-tuner to search over the parameter space for 50 trials.For each trial, we provide keras-tuner with a cross-validation (CV) set (CV fraction was 0.2 of the training set) that preserves the ratio of objects for each class.We train seven separate forests for our seven different redshift bins.
We explore many different sets of hyperparameters and allowed ranges, and find that generally the forests perform similarly regardless of fine tuning the hyperparameters.Therefore, although further tuning is possible, we conclude that it is not necessary; the forests most likely will not perform significantly better than reported here.
We categorize the output of the random forest into four classes: • True Positives (TP): the number of true mergers correctly classified by the random forest.
• False Positives (FP): the number of non-mergers incorrectly classified as mergers.
• True Negatives (TN): the number of correctly classified non-mergers.
• False Negatives (FN): the number of true mergers incorrectly classified as non-mergers.
Therefore the number of RF-selected mergers is TP + FP, and the number of RF-selected non-mergers is TN + FN.The number of true mergers is TP + FN, and the number of true non-mergers is TN + FP.We can judge the performance of the random forests using several metrics: • True Positive Rate (TPR) -also known as recall or completeness -is defined as • False Positive Rate (FPR) -also known as fall out -is defined as • Positive Predictive Value (PPV) -also known as precision -is defined as • F1 Score is the harmonic mean of precision (P) and recall (R), and is defined as Classifiers that perform well have both high precision and high recall (and therefore a high F1 score).Accuracy, defined as (TP + TN)/N where N is the total number of objects, is a biased indicator of performance for imbalanced sets, so we do not consider it here.
Figure 7 shows a confusion matrix for 3 < z ≤ 3.5 galaxies.This is an example from one of the best random forest trials.This shows that the forest correctly classified 60% of the non-merger class and 63% of the merger class.The confusion matrices for the other redshift bins all look similar to this one, where 58% − 63% of the non-merger class were correctly classified and 60% − 64% of the merger class were correctly classified (see the recall values in Figure 9).
The top panel of Figure 8 shows the corresponding receiver operating characteristic (ROC) curve for this redshift bin, which illustrates the performance of the random forest for different discrimination thresholds.A discrimination threshold is the probability cutoff (default = 0.5) used to assign the final classification.The curve of a perfect classifier would consist of two straight lines from (0,0) to (0,1) and from (0,1) to (1,1).The curve of a random classifier consists of a straight line from (0,0) to (1,1).The performance of classifiers can then be judged by how close they are to the upper left corner of the plot.This figure shows that our random forest does better than a random classifier.
The bottom panel of Figure 8 shows the corresponding precision-recall curve for this redshift bin, which in principle is more informative for imbalanced datasets than the ROC curve.For the precision-recall curve, a perfect classifier reaches the upper right-hand corner of the plot at (1,1).The curve of a random classifier is not fixed, like in the ROC curve, but determined by the ratio of positives (mergers) to total number of objects.Therefore a random classifier for a perfectly balanced dataset would lie at y = 0.5.This plot, like the ROC curve, also shows that our forest performs better than random chance.However these plots, especially the precisionrecall curve, also show that our classifiers are far from perfect.The ROC curves and precision-recall curves for the other redshift bins look similar to those shown here.
Figure 9 shows how precision, recall, and f1 scores change across redshift for both mergers and nonmergers.For mergers, the classification metrics improve as redshift increases.For non-mergers, the classification metrics generally worsen as redshift increases.This is likely due to the test set becoming more balanced at higher redshifts.To test this, we artificially balanced the z = 0.5 − 1.0 test set and found that the merger precision score (and therefore f1 score) dramatically increased and the non-merger precision (and therefore f1 score) decreased such that they were more in line with the results of the later redshift bins.This means that the performance of the z = 0.5 − 1.0 forest can be improved with respect to the merger class by simply randomly re- The black diamond shows the optimal threshold selected by G-Means, J statistic, and MCC and the black cross shows the optimal threshold selected by the balance point (see §5.2).Bottom: Precision-recall curve for the 3 < z ≤ 3.5 random forest.The black circle shows the optimal threshold selected by the f1 score (see §5.2).In both plots, the training set (green) and test set (purple) curves lie in the region of "good" classifiers (between the perfect (black ) and no skill (red ) classifiers).
moving non-mergers from the test set.This implies that the better performance of the forests of the later redshift bins is mostly due to the increasing lack of non-mergers, not because the forest was better trained.
For the effectively noiseless images, the forests trained and tested on EN Set 1 generally performed worse than the mock CEERS forests.The train and test sets were

Classifier Metric
Precision (non-merger) Precision (merger) Recall (non-merger) Recall (merger) F1 Score (non-merger) F1 Score (merger) Figure 9. Precision (squares), recall (triangles), and f1 score (crosses) for mergers (green) and non-mergers (dashed orange) as a function of redshift.Non-merger class metrics tend to worsen with redshift while merger class metrics tend to improve with redshift.The exception is recall, which is consistent around ∼ 0.60 across redshift.
larger, but this increase was mostly due to the inclusion of faint and probably ambiguous-looking galaxies that were cut from the mock CEERS set.We conclude that the difficulty of classifying these objects probably outweighed any gains from the ability to see fainter structures in the effectively noiseless images.The left panel of Figure 10 shows the confusion matrix for this set.
The forests trained and tested on EN Set 2 generally performed slightly worse than the mock CEERS forests.In this case, any gains from the effectively noiseless images are probably out-weighted by the smaller size of the training and test sets.The right panel of Figure 10 shows the confusion matrix for this set.

Feature Importance
Each forest calculates the importance of the features given to it.The more important a feature is, the more useful it is to the forest for determining the difference between mergers and non-mergers.Figure 11 shows the top five most important features for each redshift bin.This figure shows that asymmetry features (e.g., A and A O ) are most important for low redshift bins while bulge and clump features (e.g., G and S) are more important for higher redshift bins.These most important features are calculated from the science images, not the residual images.There also appears to be a dependence on filter.The bluer F115W filter is more useful for low redshift bins, and the redder F444W filter is more useful for higher redshift bins.This seems to indicate that the forests are using the rest-frame optical features to make decisions, even though all filters were available to each forest.
Figure 12 shows examples of 3.0 < z < 3.5 galaxies categorized into true positives (correctly classified mergers), false positives (incorrectly classified non-mergers), true negatives (correctly classified non-mergers), and false negatives (incorrectly classified mergers).The top left hand corner of each stamp shows the probability of the object being a merger as assigned by the random forest.The stamps are arranged in order of decreasing probability, so the horizontal orange line effectively represents the probability threshold between merger and non-merger classifications, which is the default 0.5.The top right hand corner of each stamp shows the F356W Gini statistic for each galaxy, which was the most important feature for the random forest.The bottom left hand corner of each stamp shows the merger timescale for each galaxy.The timescales include both major and minor mergers.A positive timescale indicates the time since a past merger, while a negative timescale indicates the time until a future merger.Galaxies can have both past and future mergers, so whichever timescale is smallest is shown here.Recall that the merger cutoff defined in §4.2 is ±250 Myr, so true positives and false negatives all have a merger timescale < 0.25 Gyr, while false positives and true negatives all have a merger timescale > 0.25 Gyr.Finally, the segmentation map outlines are color-coded by merger type (major or minor).
The distribution of merger probabilities ranges from about 0.3 -0.7, while the distribution of the Gini statistic ranges from about 0.4 to 0.65.There is a slight trend where probability increases as Gini increases, which becomes increasingly clear when examining the full test set.This is expected since the F356W Gini statistic was the most important feature to the random forest for this redshift bin, but the correlation is not so strong that Gini can solely be used to determine merger status.There appears to be little to no trend of merger timescale with either Gini or with probability when looking at the full test set.
Other interesting insights come from looking at the false negatives and false positives.Many of the false positives in Figure 12 have segmentation maps that appear elongated, either because the segmentation map is contaminated with emission from background or neighboring galaxies, or because the galaxy has persisting remnants of signatures from a merger outside of the chosen time frame, which may have contributed to the "merger" designation by the forest.The average past and future timescales of true negatives is 0.75 ± 0.35 Gyr and 0.66 ± 0.26 Gyr, respectively.The average past and future timescales of false positives is 0.66±0.32Gyr and 0.71 ± 0.43 Gyr, respectively.Since the timescale distribution of false positives generally matches that  of the true negatives, within error, it does not appear that false positives are more likely to be closer in time to having a merger than other non-mergers.On the other hand, many of the false negatives appear relatively undisturbed visually, especially the ones with smaller merger probabilities, suggesting that even though these are true mergers those mergers have had a relatively minor impact on the morphology.The fraction of minor mergers among the true positives and false negatives is 35.1% and 35.6%, respectively.This indicates that false negatives are no more likely to be minor mergers than true positives.

Thresholding
The probability threshold used to distinguish between mergers and non-mergers can be adjusted in an attempt to improve the performance of the random forest.There are a number of methods for selecting the optimal threshold based on the trade-off between the TPR and FPR, or precision and recall, including: (8) • Balance Point, defined by the point at which T P R = 1 − F P R.
• F1 Score -defined in §4.4 -which is the harmonic mean between precision and recall.
For the 3.0 < z < 3.5 redshift bin, G-mean, J, and MCC all return the same optimal threshold of 0.518.The balance point returns a threshold of 0.504.All four are very close to the default threshold of 0.5.Figure 8 highlights the location of these two thresholds on the ROC curve, which are located where the curve of the test set is closest to the upper left hand corner of the plot.The optimal threshold based on the F1 score was 0.470, which is highlighted in the precision-recall curve in Figure 8.Here, the threshold is located where the curve of the test set is closest to the upper right hand corner of the plot.
Use of the G-mean/J/MCC threshold (Figure 13, top) or the balance point threshold (Figure 13, middle) improves the performance of the forest on the non-merger class (fraction correctly classified = 0.66 or 0.61, respectively, where before it was 0.60), at a cost of poorer performance on the merger class.Use of the F1 score threshold (Figure 13, bottom) drastically improves the performance on the merger class (fraction correctly classified = 0.79 where before it was 0.63), but at great cost to the non-merger class (fraction correctly classified is 0.44 where before it was 0.60).This shows one could optimize the performance of the forest to generate a more complete sample of mergers, but that sample would be highly contaminated.None of the thresholds improve performance on both the merger and non-merger classes.

Merger Fraction and Merger Rate
Finally, we calculate the merger fraction and merger rate using both mergers selected by the random forest and true mergers based on our merger timescale window of 0.5 Gyr.First, we calculate the fraction of merging galaxies selected by the random forest as f uncorr (RF) = N RF /N , that is, the total number of galaxies (in the test set) selected as mergers by the random forest divided by the total number of galaxies (in the test set) for a given redshift bin.Then we multiply this fraction by P P V /T P R and < M/N > (Snyder et al. 2019).P P V /T P R corrects for the known incompleteness and purity of the classifier based on the training set.< M/N > is the average number of merging events per true merger, and accounts for the fact that some true mergers experience more than one merger during the specified time frame of 0.5 Gyr.Therefore, the actual merger fraction for the RF-selected mergers is: Here, < M/N > was calculated using only the true positives in the test set.
The merger fraction for the true merging galaxies (from the test set) based on our merger timescale definition is then since there is no need to correct for the performance of the random forest classifier.But our intrinsic merger definition also does not account for multiple mergers with our time frame.Here, < M/N > was calculated using the true positives plus the false negatives in the test set.
The left panel of Figure 14 shows our uncorrected random forest merger fraction f RF , the corrected random forest merger fraction f merger (RF), and the true merger fraction f merger (true).This plot reveals that the uncorrected fraction f RF is overestimated compared to the true merger fraction, but once corrected by P P V /T P R, the RF-selected merger fractions and the true merger fractions line up very well.This panel also shows the theoretical Illustris merger fraction (derived from Rodriguez-Gomez et al. 2015) and the random forest merger fraction estimated from Snyder et al. (2019).Our uncorrected merger fractions line up very closely with the uncorrected merger fractions from Snyder et al. (2019).The shape and steep slope of our corrected fraction f merger (RF) and true fraction f merger (true) generally match the theoretical Illustris fraction.However, our f merger (RF) and f merger (true) are underestimated compared to theory and the corrected random forest fractions from Snyder et al. (2019).
To calculate merger rates, we divide the merger fractions by our merger window timescale of 0.5 Gyr.The right panel of Figure 14 shows our merger rates for the corrected RF-selected mergers and for true mergers.We again show the theoretical Illustris merger rates derived from Rodriguez-Gomez et al. (2015) as shown in Snyder et al. (2019).This panel shows that our merger rates are underestimated by a factor of about 0.5 (grey dashed line) when compared to the theoretical merger fraction.
Since both our random forest fractions and rates, and true fractions and rates are underestimated when com-pared to theory, the issue may lie in our calculation of < M/N >.From our merger history catalog, we calculate the timescales for the most recent and the first future major and minor mergers.However, this does not tell us if a galaxy has experienced e.g., more than one past major merger or more than one future minor merger within the specified merger window.Therefore our values for < M/N > are almost certainly underestimated, which may account for the discrepancy between our data and the theoretical Illustris curves.

Comparison to Previous Works
We compare our results to that of Snyder et al. (2019) and Sharma et al. (2021), both of which uses random forests to classify high redshift merging galaxies.Snyder et al. (2019) classify mergers in a sample of 0.5 < z < 4.0 Illustris-1 simulated HST galaxies, with a merger definition of ±250 Myr, and report a true positive rate of ∼ 70% across their redshift range.Sharma et al. (2021) classify mergers in a sample of 0.5 < z < 3 SPHGal simulated HST images, with a merger definition of ±500 Myr, and report a true positive rate of 0.95 for the full data set (see their Figure 3).
There are a few possible explanations for the better performance of these studies.Both Snyder et al. (2019) and Sharma et al. (2021) focused on merger classification performance, whereas here we try to optimize performance on both mergers and non-mergers.Sharma et al. (2021) specifically note that their merger fraction is overestimated due to false positives.Snyder et al. (2019) trained random forests on single snapshots, whereas here we use redshift ranges, so the forests from Snyder et al. (2019) may be overtrained on a per-snapshot basis.Finally, Snyder et al. (2019) and Sharma et al. (2021) use Illustris-1 and SPHGal, respectively, to create their simulated images rather than IllustrisTNG as we do here, and there may be differences between the three that make it easier or harder of select mergers with this technique.

SUMMARY AND CONCLUSIONS
In this work, we investigate using random forests to classify merging galaxies in simulated CEERS-like and nearly noiseless images, which were constructed from Il-lustrisTNG and the Santa Cruz SAM.We use the morphology programs Galapagos-2 and statmorph to calculate a number of morphology parameters which were then used as inputs to the random forests.We also use IllustrisTNG merger history catalogs to define intrinsic merger labels which were also given to the forests.We train seven random forests for seven different redshift bins, and find the following results: 1.The forests correctly classify ∼60% of mock CEERS mergers and non-mergers across all redshift bins.The precision of the merger class increases with redshift while the precision of the nonmerger class decreases with redshift.ROC curves and precision-recall curves indicate that the forests perform better than random classifiers.The random forests do not perform better when trained and tested on morphology parameters from nearly noiseless simulated images.
2. Rest-frame asymmetry features tend to be most important for merger classifications at low redshift, while rest-frame bulge and clump features tend to be more important at higher redshifts.
3. False positives tend to appear elongated with potential faint merger signatures, despite being no more likely to be closer to the ±250 merger window than true negatives.False negatives tend to appear undisturbed by their recent mergers.
4. Selecting different probability thresholds results in improved performance on the merger class at the cost of worse performance on the non-merger class.
5. After correcting for the incompleteness and purity of the forests, we recover the true merger fraction of the mock CEERS dataset very well (using the ±250 Myr merger definition).The shape and slope of our mock CEERS corrected merger fraction and merger rate, f merger (RF) and R merger (RF), match with theoretical Illustris predictions.However, our f merger (RF) and R merger (RF) are underestimated compared to Illustris predictions.
Given a sample of CEERS galaxies with unknown merger labels, the results of this work indicate that we could recover a reasonable merger fraction and merger rate.However, it would be difficult to disentangle specific true mergers from misclassified non-mergers.
One area of improvement lies with the segmentation map.The morphology parameters described in this work all depend on how galaxies are identified and deblended in the segmentation map, so great care must be taken in how sources are detected and deblended.The impact of source detection on merger identification is an important topic for future exploration.
Our findings suggest that we have reached the ceiling on how well random forests are able to identify mergers from these standard morphological parameters.Further improvement is likely to be gained through training convolutional neural networks (CNNs) to identify mergers directly from the images, which will be the subject of a future paper.

Figure 2 .
Figure 2. Examples of galaxy mergers at different redshifts from the original F277W image (column A) before undergoing the following modifications: convolution with the PSF (column B); inclusion of Poisson noise and CEERS background noise (column C); and then background subtraction (column D).The nearly noiseless versions are shown in column E. The magenta contours in columns D and E are the segmentation map outlines from Source Extractor.

Figure 3 .
Figure 3. Stellar mass versus redshift for objects from the full original IllustrisTNG sample (grey), and from the mock CEERS sample with z < 4, the focus of this paper (blue).Above and to the right are the distributions of redshift and stellar mass, respectively.

Figure 4 .
Figure 4. Star formation rate versus stellar mass for objects from the full original IllustrisTNG sample (grey), and from the z < 4 mock CEERS sample (blue), split by redshift bin.The main sequence fits from Whitaker et al. (2014, purple) and Schreiber et al. (2015, orange) are shown for comparison.

Figure 5 .
Figure 5. Merger history histograms for objects from the full original IllustrisTNG sample (grey), and from the z < 4 mock CEERS sample (blue) The red shading indicates the 500 Myr window used to distinguish between mergers and non-mergers for our random forest experiments.Each galaxy in our dataset has four merger timescales -past and future major (top) and past and future minor (bottom).

Figure 6 .
Figure 6.F277W G − M20 space for mock CEERS galaxies in the 3.0 < z < 3.5 redshift bin.True non-mergers are grey.True mergers are colored by the rainbow gradient which indicates the merger timescale used for merger classifications.

Figure 7 .
Figure 7. Confusion matrix for 3 < z ≤ 3.5 objects from the test set.The diagonal shows the fraction of objects correctly classified for each class.

Figure 8 .
Figure8.Top: ROC curve for the 3 < z ≤ 3.5 random forest.The black diamond shows the optimal threshold selected by G-Means, J statistic, and MCC and the black cross shows the optimal threshold selected by the balance point (see §5.2).Bottom: Precision-recall curve for the 3 < z ≤ 3.5 random forest.The black circle shows the optimal threshold selected by the f1 score (see §5.2).In both plots, the training set (green) and test set (purple) curves lie in the region of "good" classifiers (between the perfect (black ) and no skill (red ) classifiers).

Figure 10 .
Figure 10.Confusion matrices for 3.0 < z ≤ 3.5 objects from the EN Set 1 (left) and from the EN Set 2 (right).

Figure 11 .
Figure11.The top five most important features (1 -most important, 5 -5 th most important) in each redshift bin, as a function of filter.Bluer colors correspond to bluer filters, and redder colors correspond to redder filters.The abbreviations are: AOouter asymmetry, A -asymmetry, C -concentration, D -deviation, G -Gini statistic, m20 -moment of light, M -multimode, n -Sérsic index, S -clumpiness.See §3 and references therein.Asymmetry features in the bluer filters are more important for low redshift bins while bulge/clump features in the redder filters are more important for higher redshift bins.

Figure 12 .Figure 13 .
Figure12.Examples of true positive (TP), false positive (FP), true negative (TN), and false negative (FP) galaxies in the F356W filter from the 3.0 < z < 3.5 redshift bin.Each stamp is 3 x 3 arcsec.In each stamp: the merger probability output by the forest is in the upper left, the F356W Gini statistic is in the upper right, and the timescale since or until the most recent merger (major or minor) is in the bottom left.The outlines show the segmentation map, color-coded by major (magenta) and minor (green) mergers, respectively.

Table 2 .
Source Extractor parameters for each set of images.