Identify 46 New Open Clusters Candidates In Gaia EDR3 Using pyUPMASK and Random Forest Hybrid Method

Open clusters (OCs) are regarded as tracers to understand stellar evolution theory and validate stellar models. In this study, we presented a robust approach to identifying OCs. A hybrid method of pyUPMASK and RF is first used to remove field stars and determine more reliable members. An identification model based on the RF algorithm built based on 3714 OC samples from Gaia DR2 and EDR3 is then applied to identify OC candidates. The OC candidates are obtained after isochrone fitting, the advanced stellar population synthesis (ASPS) model fitting, and visual inspection. Using the proposed approach, we revisited 868 candidates and preliminarily clustered them by the friends-of-friends algorithm in Gaia EDR3. Excluding the open clusters that have already been reported, we focused on the remaining 300 unknown candidates. From high to low fitting quality, these unrevealed candidates were further classified into Class A (59), Class B (21), and Class C (220), respectively. As a result, 46 new reliable open cluster candidates among classes A and B are identified after visual inspection.


INTRODUCTION
Open clusters (OCs) were simultaneously formed from the same molecular cloud, and gravitationally bound stellar systems were born in the same starburst. Therefore, OCs are a kind of natural laboratory and valuable tracers for studying galaxies' structure, chemical composition, and dynamical evolution, as well as providing validation and constraints on the model of evolutionary astrophysical (Spina et al. 2022). For example, young OCs are often assigned to analyze the structure of galaxies. They are also used as testbeds to study stellar evolution, allowing us to investigate the boundary conditions necessary for new star formation (Cantat-Gaudin et al. 2018a). Besides providing information about the height and radial extension of the galactic disk, old OCs also provide information about the chemical history of the galaxy, e.g., the relationship between age and metallicity, mixing processes, and cluster destruction processes caused by interactions with other clusters.
However, due to the limitations of Galactic dust extinction and contamination of field stars (foreground and background stars), identifying OCs is still a challenging issue (Deb et al. 2022). Many efforts have been made to hunt OC candidates from the Gaia Second Data Release (Gaia DR2, Gaia Collaboration et al. (2018)) and Gaia (Early) Data Release 3 (Gaia (E) DR3, Gaia Collaboration et al. (2021)).
Various methods based on an unsupervised machine learning clustering algorithm have been used to search for OCs. One of the most successful searching methods is the Density-Based Spatial clustering of Applications with Noise algorithm (DBSCAN) (Ester 1996). A series of DBSCAN variants based on DBSCAN is capable of effectively identifying OCs (Castro-Ginard et al. 2018;Castro-Ginard et al. 2019;Castro-Ginard et al. 2020;He et al. 2021;Castro-Ginard et al. 2022). In addition, Kounkel & Covey (2019); Kounkel et al. (2020); Hunt & Reffert (2021) used the improved method Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) based on DBSCAN to detect many new clusters. Cantat-Gaudin et al. (2018b) applied an unsupervised membership assignment code (UPMASK) to the Gaia DR2 data contained within the fields of those clusters. Gao (2018); Xinhua & Gao (2020) identified cluster members using a Gaussian mixture model (GMM) clustering method.
Besides DBSCAN and its variants, the friends-of-friends (FoF) algorithm (Yang et al. 2008) is also applied to identify OCs. Liu & Pang (2019) found 76 candidate star clusters from Gaia DR2 using the FoF algorithm. Li et al. (2022) also used the FoF algorithm to perform a blind search for OCs in Gaia EDR3 within 25 degrees of the Galactic plane. As a result, 61 new OCs were reported among the 868 candidates. The advantage of the FoF algorithm to group stars is that clustering considers a five-dimensional weighted parameter space of parallax, position, and velocity. However, its disadvantage is that it is not sensitive to the size of the cluster radius and the uneven distribution of star density which changes at different distances b F oF is a unique hyperparameter of the FoF algorithm, significantly impacting the clustering results. Liu & Pang (2019) proposed a high-performance approach (i.e., SHiP) to calculate b F oF in each data region, which has been successful in finding many open clusters where N star is the number of stars in each region. However, the approach has some minor deficiencies. For example, it is relatively ineffective in searching for member stars in some sparse spaces of star clusters. On the other hand, the minimum number of clusters is set to a predetermined fixed value, e.g. 50. This may lead to some member stars being incorrectly included in a cluster during the merging process of the method.
In this study, we referred to the study of Li et al. (2022) to obtain a data set of open cluster candidates using the FoF algorithm. We then presented an improved hybrid algorithm to identify OCs from open cluster candidates found by FoF more robustly. The rest of the paper is organized as follows. We described the method in Section 2, including the membership determination method and identification model for OCs. The results are presented in Section 3, which includes sample handling, isochrone-fitting, cross-matching, and visual inspection. We discussed the results in Section 4. Finally, a conclusion is covered in Section 5.

AN IMPROVED IDENTIFICATION METHOD FOR OPEN CLUSTER
In light of the previous literature mentioned in the first section, we realized that improving OC recognition accuracy with machine learning methods faces two problems. One is improving the quality of samples, and the other is optimizing the final OC identification model. Therefore, we presented a robust identification approach for OCs and showed the flow chart in Figure 1.

Member Star Determination Method
Determining which of the stars in a cluster candidate are member stars and which are field stars is a challenging task. We presented a hybrid algorithm based on pyUPMASK (Pera et al. 2021) and random forest (RF) to eliminate false member stars (field stars) among those star clusters. We created a dataset for each cluster candidate, respectively. Each star in the dataset was labeled as a member star or field star using PyUPMASK. RF algorithm modeled star data to predict more credible member stars.
1.pyUPMASK is a Python package for Unsupervised Photometric Membership Assignment in Stellar Clusters (UP-MASK) Krone-Martins & Moitinho (2014) used to estimate the membership probability of each input star. pyUP-MASK has been widely used in the determination of member stars of OC based on astrometric parameters He et al. 2022b;Bai et al. 2022;Dias et al. 2022).
The essence of the UPMASK algorithm is to calculate the kernel density estimation likelihood (KDE) of the member stars of the OC candidates. The membership probability of stars is iteratively calculated by  Figure 1. Flowchart of the hybrid identification approach. The process of this approach consists of 3 stages. Stage 1 deals with rough FoF clustering within Gaia EDR3. Then, each OC's membership is identified in stage 2. After refining the star member, we implement recognition of OCs, including a submodule of modeling for identifying OCs, isochrone fitting, and visual inspection in stage 3.
where P star , KDE m , and KDE nm are the membership probability of star, KDE of the members, and field star, respectively. In general, besides coordinates, pyUPMASK requires at least two dimensions of data of any type to estimate membership probabilities. We chose coordinates and proper motion as the input of pyUPMASK for some distant open cluster candidates (excluding parallax and photometric data), which has been confirmed effective in Perren et al. (2022). To speed up the computing process, we performed parallel computing using Mpi4py (Dalcin et al. 2008). It allows us to process a volume of data that does not fit in the memory of a single machine as well.
After the pyUPMASK calculation, we refer to the Gao (2018) and set the value of membership probability as 0.8. The stars with a membership probability greater than 0.8 are labeled member stars, and others are labeled field stars, which is different from the practices of He et al. (2021).
2. Based on the label and eight parameters extracted from the catalog, i.e., l, b, , µ a * , µ δ , G, G RP , and G BP , we built an RF model for each OC. We regarded the identification of cluster members as a supervised binary classification problem. The RF model predicted the final member stars for each OC.

Identification model for Open Cluster
We applied an RF classifier to detect the OCs among the potential candidates. An RF classifier model was built with 1229 positive OC samples collected from Gaia DR2 (Cantat-Gaudin & Anders 2020) and 628 positive OC samples collected from Gaia EDR3 (Castro-Ginard et al. 2018;Castro-Ginard et al. 2019;Castro-Ginard et al. 2020Castro-Ginard et al. 2022). The negative OC samples, which have the same number of positive OC samples, are synthesized with stars, assuming their spatial distribution is a random uniform distribution. We finally obtained a trainset of 3714 OC samples for modeling.
We calculated a confusion matrix (see Figure 2) as evaluation metrics to evaluate the model. The precision of the model is 99.35%.

Data Preparation For Open Cluster Candidates
We strictly followed the data preparation method of Li et al. (2022) to generate the OC candidate dataset. First, we filter suitable samples to exclude observational artifacts due to faintness in Gaia EDR3 based on the stellar position, self, and parallax parameters ( between 0.2 mas and 0.7 mas, G < 18 mags and µ α cosδ < 30 mas yr −1 , µ δ < 30 mas yr −1 ). Meanwhile, considering most OCs are centered near the galactic disc, we set |b| < 25 degrees. Second, based on 180 million sources extracted, to facilitate this procedure, we divide the entire search volume into multiple data regions, we roughly divided the data into many data regions according to galactic longitude (l), galactic latitude (b), and parallax ( ). The number of divisions for , b, and l are 8, 8, and 64, respectively. To avoid splitting the clusters into different regions as much as possible, each of the data regions must not be smaller than two times the typical cluster size (20 pc) (Portegies Zwart et al. 2010). To deal with potential clusters located at the boundaries of the region, we set an overlapping region for the two adjacent regions with size ( size 0.2 mas, l, and b size 10 pc). After carrying out the above scheme, the whole search volume is divided into 4091 data regions.
We used FoF clustering for each data region to find local clusters and aggregate them to obtain 3597 candidate clusters. After cross-matching, we got 807 new candidates for the cluster. Using the membership determination RF model, we removed the field stars from each of these 807 candidates. We then classified these 807 candidates using an identification RF model. 801 candidates were classified as open clusters, and the model rejected the other 6 candidates.

Validation Of Open Cluster Candidates
After obtaining 801 open cluster candidates, we cross-matched our candidates with open cluster catalogs published. We consider an OC to be positionally matched to a cataloged one if their centers lie within a circle of radius r = 0.5 degrees and the rest of the astrometric mean parameters are compatible within 5σ, which is consistent with Hunt & Reffert (2021); Castro-Ginard et al. (2021). Here, σ is the uncertainties quoted in both catalogs for each quantity.
We gathered most of the known star cluster catalogues and labelled them as MWSC, CG2017, Hao3794, UBC series, CWNU, Dias1743, and Hao704, respectively. The pre-Gaia cluster catalogs (MWSC) contain 3006 star clusters gathered by Dias et al. (2012); Kharchenko et al. (2013).  (2021), Li et al. (2021). In particular, using the same method, we performed an integral crossover of 46 recently reported star clusters by He et al. (2022b). The 46 newly discovered clusters are located in high-Galactic latitude regions with |b| >20 degrees. In contrast, ours are located at |b| <25 degrees. Therefore, there cannot be any match between our cluster and any of the 46 newly discovered clusters.
MWSC catalogs contain 2976 cataloged objects local to galactic latitude 25 degrees gathered from different data sources. Because they do not allow for sufficiently accurate comparisons in proper motion space, we only performed a 0.5 degree positional cross-match based on sky coordinates.
After cross-matching, 501 of the 801 candidates were already identified. We obtained 300 candidate clusters that have not been identified and reported, which is the data set for subsequent OC identification.

Color-Magnitude Diagrams fitting
The color-magnitude diagrams (CMDs) were fitted by two independent approaches, i.e., the fitting based on isochrone models (Bressan et al. 2012) and the fitting based on the advanced stellar population synthesis (ASPS) model (Li et al. 2016;Li et al. 2017). We first fitted CMDs by isochrone fitting method and then validated the results with ASPS.
The isochrone fitting method is a mature and classic fitting method. In theory, member stars in the OCs are born from the same gas cloud in a single episode of star formation. Most of them are expected to follow a single isochrone in color-magnitude diagrams (CMDs) and have the same metallicity and age. Therefore, we used the isochrone-fitting method with the PARSEC theoretical isochrone models (Bressan et al. 2012) updated by the Gaia EDR3 passbands using the photometric calibrations from ESA/Gaia to derive their physical parameters (age and metallicity). We applied a log-normal initial mass function Chabrier (2003) to generate an isochrone library from log( t yr ) = 6.0 to 11.13 at steps of ∆(logt) = 0.03 while metal fractions from 0.002 to 0.042 with a step of 0.002. Table 1 presents parameters ranges and steps for isochrone fitting. An objective fitting function n was applied to 300 OC candidates, where n is the number of selected members in a cluster candidate, and x k and x k,nn are the positions of the member stars and the points on the isochrone that are closest to the member stars, respectively. ASPS is a model that contains different kinds of stellar populations, including single-star simple stellar populations (ssSSP); binary star stellar populations (bsSSP); single-star composite stellar populations (ssCSP); binary-star composite stellar populations (bsCSP); the simple stellar population of single, binary, and rotating stars (sbrSSP); and composite stellar population of single, binary, and rotating stars (sbrCSP). ASPS can be used to fit the CMDs and determine the cluster properties. Due to the large number of parameters considered, the fit of ASPS for CMDs takes a longer time.

Maunal Verification And Final Results
After isochrone fitting, we classified OC candidates based on their CMD morphology and isochrone fitting results to facilitate our manual visual inspection. Referring to some previous literature (Liu & Pang 2019;Castro-Ginard et al. 2020;He et al. 2022b,c), we applied the following parameters to the classification.
1. n star: Number of member stars that brighter than G < 17 mag.
2. d 2 : The average square of the distance between cluster stars and an isochrone used to measure the isochrone-fitting error.
3. r n : The narrowness of the MS in the CMD calculated as v 1 v 2 . v 1 and v 2 are the two eigenvalues of the covariance matrix of the distribution of stars in the CMD.
We classified 300 open cluster candidates into three classes: 1. Class A : n star 20, r n < 0.1,d 2 < 0.05; 2. Class B : n star 20, r n < 0.      and Class B) mentioned above, 46 candidates were finally considered as possible real OCs (see Table 1). Parts of these 46 candidates are shown in Figure  To further validate the results, we fitted the CMDs of these 46 OC candidates using the ASPS model. Comparing the results of ASPS-based fitting and isochrone fitting, 42 OCs are consistent (see Figure 3-6 in Appendix A). 4 cluster candidates with the ID of 3512, 3526, 3567, and 3595 are not well suited to ASPS models. However, these four cluster candidates give very reasonable results in isochrone fitting and also have a high probability of being clusters. Figure A.7 in Appendix shows the results of isochrone fitting. Figure 4 and 5 indicate most new candidates distributed between the Norma and Near arms. Figure 6 shows the distributions of age and metallicity of the newly identified OCs. We found that these OCs are younger than 3.0 Gyr (see Figure 6 (b)). Additionally, most of them are metal-poor ( see Figure 6 (a)).

Approach Limitations
We only use pyUPMASK on the 2-dimensional proper motion parameters space on the higher dimensional feature space instead of using the member star probability census for most clusters. This is because as the distance of the star increases, the uncertainty of its parameters, such as parallax, will increase significantly, introducing more uncertainty.  During the member stars census, not all the member stars of the cluster are identified by the RF model. Because in some cases, the membership probability after the census is greater than the threshold we set after running pyUPMASK. For such star clusters, we do not use the RF model. Instead, we use pyUPMASK for membership probability filtering in a 5-dimensional (l, b, , µ a * , µ δ ) space. In another case, for those star clusters whose number of member stars is less than 10 after the calculation of the member probability satisfies the filtered condition of the probability threshold, we consider these star clusters as false and discard them.
A point that needs to be explained is that when training the RF model for the recognition of member stars, we adopt the weighted RF algorithm. This is because the training set of label samples is unbalanced, produced by pyUPMASK.

Performance Analysis Of Member stars Determination
To validate the proposed hybrid method, we first test it on a well-studied open cluster, i.e., M67 (NGC 2682), which is a well-studied open cluster whose members are publicly available in many studies Castro-Ginard et al. (2020);  Ghosh et al. (2022). We downloaded sources from Gaia EDR3 in a cone around the open cluster center within a radius of 50 pc (hereafter, all sources). And then, we applied our method to detect the member of those OCs according to the step in Figure 1. Jadhav et al. (2021) used combinations of astrometric, photometric, and systematic parameters to train and supervise a machine learning algorithm along with a Gaussian mixture model for the determination of cluster membership of M67 using the Ultra Violet Imaging Telescope (UVIT) aboard ASTROSAT and Gaia EDR3 (hereafter VV21). Since this is the latest representative research result, we mainly focus on this result for comparative analysis. Compared with the other three studies, i.e., 766 member stars found by FoF, 484 member stars in CG20, and 746 member stars in VV21, we obtained 1131 M67 member stars from all sources. Figure 10 shows that our results agree with VV21 and FoF for the most part.
We further tested the robustness of our method in smaller known clusters. We chose two clusters that were smaller in size. One cluster is UBC1029 with 40 members (Castro-Ginard et al. 2022). Another is OC0033 with 47 members (Hao et al. 2022). From left to right, Figure 7, 8 and 9 show the experimental results of equatorial coordinates spatial distribution, proper motion distribution, CMD, and parallax distribution hist, respectively. The mean and variance of each astrometric value (position, parallax, and proper motion) and the number of member stars are presented in Table 3. The results show that the proposed method could accurately identify nearby massive and smaller distant clusters. Compared with other OC studies using Gaia EDR3 data, the member stars we identified are significantly more concentrated. This is because they have a more focused spatial distribution, a clear isochrone feature, and more member stars.
Some discrepancies in the results for the member stars are reasonable. This is because the performance is poor for clusters smaller than 50 Myr, which tend to be embedded in their star-forming regions. On the other hand, clusters at distances larger than 1.5 kpc have larger astrometric errors, which makes the member star analysis less reliable (Tarricq et al. 2022).

PDs, Parallax and proper motion dispersions
We compared the newly discovered cluster candidates with known ones based on CG20 (Cantat-Gaudin & Anders 2020). Figure 11 shows the distribution of OC candidates proposed in the study, including 46 newly identified OCs, and over 500 matched OCs. The location of the proposed candidates matches that of the previous OCs. The vast majority of the new OC candidates (except one) are located at |b| < 15 degree, and 95 % of them within |b| < 10 degree.
As shown in Figure 12, the distribution of our OC candidates indicated that the parallaxes of them are mostly in the range of 0.16 and 0.42, which is consistent with the results of previous work.  OURS Figure 10. The spatial distribution of M67 cluster members compared to VV21 and FoF methods. It should be noted that VV21 appears to have a more comprehensive census. This is because VV21 uses the Ultraviolet Imaging Telescope 210 (UVIT) on ASTROSAT and Gaia EDR3 to determine a combination of astrometric, photometric, and systematic parameters of the M67 cluster members (Jadhav et al. 2021). While our work only used the astrometric and photometric data from Gaia EDR3. The Venn diagram of members is shown in the rightmost subplot. VV21 and ours have 460 identical member stars (over 61%), while FoF and VV21 only have 355 identical members ( 47.5 percent). Our approach resulted in a 14.5 percent increase in joint membership, suggesting that our method is valuable.  In addition, we also compared the proper motion dispersions of our new OCs with those of CG20. Figure 13 shows the distribution of OC candidates identified in the study have similar smaller dispersions to the known OCs, which is the characteristics of a real cluster ( Figure 13. Total observed proper motion dispersion using Gaia EDR3 vs. parallax.The x-axis is a log scale for known OCs from the homogeneous OC catalog of CG20 (black dots) compared with the new candidates in this study (red dots).

Classification And Result Analysis
As a result of our high classification criteria, the number of OCs in Class A and Class B is relatively low in the final classification results. Meanwhile, the fitted models used are more considerate of multi-family situations such as binary stars, stellar rotation, and multiple starbursts. The parameters of the generative theory isochrone used in the fit are sparse, resulting in some potential star clusters not being fitted.
However, analyzed in a different way, the FoF and our identification model yielded 801 candidates. There are 501 cross-matched and 46 newly identified sources, which means that 68.16 percent (547/801) of these sources are successfully identified by our proposed method, which shows its value.

Discussion of CMD fitting
To validate our isochrone fitting method, we randomly selected 4 clusters from Bossini et al. (2019), which have a similar size to our reported 46 OCs. We fitted them with isochrones and methods presented by Bossini et al. (2019), respectively. Fig 14 show the fitted results. Overall, the fitting results of two methods are consistent. The error between two methods is within acceptable limits. Two methods may use different isochrone libraries, resulting in errors. While Bossini et al. (2019) used the isochrones library based on Gaia DR2, our method uses the isochrones library updated by Gaia EDR3 passbands using the photometric calibrations. The fitted parameters of two methods are shown in Table 4.  We further inspected the CMD fit results. We noticed that some parameters in the fitting results were not reasonable (e.g., the age of ID1746 was only the age of 4 dex). According to the method of He et al. (2022a), we also selected the member stars with small errors and re-fitted them using the isochrone-fitting method, and obtained reasonable results.

Future works
We identified 46 reliable clusters among 300 OC candidates. However, we cannot regard the rest of the 254 candidates are not open clusters. It can only be said that the method we proposed in the study cannot accurately identify these 254 candidates. We still suspect that there are Open Cluster samples among these 254 candidates. We need to find other methods in the future.
In addition, multi-view learning should be further introduced in the future. We supposed that the member star consists of three basic sub-views: 2D proper motion, the three-dimensional position, and the magnitude (photometric) sub-view. We completed the probability census in one sub-view, which can be expanded to multiple views and then integrated with complementary information to improve the accuracy of star member identification in the future.

CONCLUSIONS
In this study, we proposed a robust approach to identifying OCs. For the given OC sample data, a pyUPMASK and RF hybrid method is first used to remove field stars. Then an identification model based on the RF algorithm and Gaia EDR3 data is used to identify OC candidates. Finally, open cluster candidates are obtained after isochrone fitting and manual visual inspection. Based on the proposed approach, we obtained 46 new reliable open cluster candidates that have not been reported before, which proved that the method proposed in the study is reasonable.