A New Cosmic-Ray Rejection Routine for HST WFC3/UVIS via Label-free Training of deepCR

deepCR is a deep-learning-based cosmic-ray (CR) rejection framework originally presented by Zhang & Bloom. The original approach requires a dedicated training set that consists of multiple frames of the same fields, enabling automatic CR labeling through comparison with their median coadds. Here, we present a novel training approach that circumvents the need for a dedicated training set, but instead utilizes dark frames and the science images requiring CR removal themselves. During training, CRs present in dark frames are added to the science images, which the network is then trained to identify. In turn, the trained deepCR model can then be applied to identify CRs originally present in the science images. Using this approach, we present a new deepCR model trained on a diverse set of Hubble Space Telescope images taken from resolved galaxies in the Local Group, which is universally applicable across all WFC3/UVIS filters. We introduce a robust approach to determining the threshold for generating binary cosmic-ray masks from predictions from deepCR probability maps. When applied to the Panchromatic Hubble Andromeda Southern Treasury survey, our new deepCR model added ∼7% of good-quality stars that exhibit distinct features in their color–magnitude diagrams.


Introduction
Highly sensitive detectors, such as charged-coupled devices (CCDs), utilized for imaging and spectroscopy observations, can experience the influence of high-energy particles known as cosmic rays (CRs).The interaction with CRs results in diminished detector sensitivity within the regions that they strike.The impact of CRs is evident in both ground-based observations and those conducted by space-based telescopes.
The frequent occurrence of CRs can give rise to conspicuous illuminated spots or elongated features in images, which may generate spurious source detections or distort the genuine characteristics of objects.CR impacts manifest in spectroscopy as artificial spikes in spectra, posing challenges in the precise identification of actual spectral signals.Failure to detect and rectify CR artifacts can lead to unreliable measurements in the respective pixel or region of the detector, ultimately resulting in imprecise inference of the physical parameters of the observed objects.Space-based instruments, such as the Hubble Space Telescope (HST), are notably susceptible to these charged particles due to the lack of Earth's atmospheric shield.
Various image processing techniques are employed to identify and replace CR-affected pixels.Taking multiple exposures of the same area, or multiple nondestructive readouts during a single exposure (e.g., Fixsen et al. 2000), is a common method used to identify these artifacts through image comparisons (e.g., Zhang 1995).However, many circumstances require methods that can identify CRs in single exposures.CRs can induce abrupt changes in luminosity or spectra in time-domain observations, as seen in variable stars or transient events.Additionally, pixels can be impacted by CRs across multiple exposures, and certain affected pixels might not be flagged after combining individual images.
The techniques utilized to combine multiple exposures for CR rejection have achieved a significant level of refinement (e.g., Windhorst et al. 1994;Freudling 1995;Fruchter & Hook 1997;Gruen et al. 2014;Desai et al. 2016), particularly those developed for dithered HST data.The original singleframe images, as downloaded initially from the Mikulski Archive for Space Telescopes (MAST), have been calibrated using the MAST data reduction pipeline, where the CR rejection has been carried out and flagged in the data quality extension.However, the accuracy of the CR flags from the pipeline should be inspected due to the varying observing strategies of different programs.For example, certain extensive treasury programs, such as the Panchromatic Hubble Andromeda Treasury (PHAT; Dalcanton et al. 2012), PHAT Triangulum Extended Region (PHATTER;Williams et al. 2021), and the Panchromatic Hubble Andromeda Southern Treasury (PHAST; Z. Chen et al. 2024, in preparation), have only two exposures per band in UVIS and a notably extensive range of exposure times, making it challenging for the MAST pipeline to reliably determine CRs.
As a result of these challenges, the PHAT and PHATTER surveys, instead, carried out their own CR flagging using the AstroDrizzle function from the Drizzlepac package Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence.Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.(STSCI Development Team 2012;Hack et al. 2013;Avila et al. 2015).AstroDrizzle detects CRs by identifying pixels that exhibit brightness in one exposure but not in the other exposures in the same band.While this legacy method typically performs well in optical images with a large number of detectable and alignable sources, facilitating easier CR identification, it encounters challenges in UV images, where sources are usually sparse.In these scenarios, the potential for CR contamination increases significantly.Despite undergoing CR flagging via AstroDrizzle, the resultant images still retain a notable number of CRs, particularly in regions of reduced exposure coverage in a given dither pattern, such as the chip gap.The misidentification of CR artifacts as real sources can lead to difficulties in aligning objects across images, while this misalignment can, in turn, cause real sources to be mislabeled as CRs due to their inconsistent locations between images.
Specialized algorithms to identify and eliminate CR effects in single images or spectra include linear filtering (Rhoads 2000), median filtering (e.g., IRAF QZAP, XZAP by M. Dickinson), Laplacian edge detection (van Dokkum 2001; LACosmic), image data histogram analysis (Pych 2004), and machine learning techniques using neural networks (e.g., Murtagh & Adorf 1991) or decision trees (Salzberg et al. 1995), among others.Farage & Pimbblet (2005) assessed these CR rejection algorithms and reported that the widely recognized LACosmic algorithm demonstrates the highest overall performance in terms of detection and cleaning.
More recently, Zhang & Bloom (2020) introduced the first deep-learning-based CR identification and replacement framework, deepCR.This latest CR-rejection software package exhibits greater robustness than all established methods, including LACosmic, while maintaining comparable processing speed.However, the original deepCR training approach requires a dedicated training set consisting of multiple frames of the same fields, enabling automatic CR labeling through comparison with their median coadds.Currently, deepCR has only trained models for HST Advanced Camera for Surveys (ACS)/WFC data (Kwon et al. 2021), and such a training approach substantially hinders the development of deepCR models for new instruments.There is a significant demand within the community for a global deepCR model that can be applied to images observed with other instruments.
In this paper, we introduce a novel approach using CRs labeled directly from dark frames, eliminating the need for CRfree science images during the training.With this approach, we present an update to deepCR, which is now universally applicable across all HST WFC3/UVIS filters.We provide the model architecture and the novel training approach in Section 2. We introduce the construction of the data set in Section 3. We evaluate the performance of our global model on the PHAST survey and present how users can robustly determine the threshold for generating binary CR masks in Section 4.

Model Architecture and Training Procedures
The deepCR framework predicts probabilistic maps of CRaffected pixels in input images, which are then transformed into binary CR masks using a user-defined threshold.See Section 4.2 for further discussion on how users can robustly determine such a threshold.The architecture utilized in this study is a refined UNet model (Ronneberger et al. 2015), which itself is a sophisticated convolutional neural network with an encoder-decoder framework.The use of skip connections facilitates a more effective combination of both high-level features (location and properties of CR artifacts) and low-level features (exact CR boundaries).Specifically, the deepCR modification preserves the complete dimensions of the input data throughout the prediction process, thereby retaining information at the edges of the images that would typically be discarded in a traditional UNet implementation.
The training data for the neural networks consist of CRaffected images paired with corresponding ground-truth CR masks.In this study, we introduce a novel approach for constructing the deepCR training data set.The original method, as used in Zhang & Bloom (2020), necessitates a dedicated training set for each target filter.This training set consists of multiple exposures of the same fields, enabling CR labeling on science images through a comparison with their median coadds using AstroDrizzle (Hack et al. 2013).However, this approach can considerably complicate the development of deepCR models for new instruments.
Here, we introduce a new method that utilizes CRs present in the calibrated dark frames.Automatic CR labeling is achieved directly through sigma clipping.For each dark frame, we apply iterative 5σ clipping to calculate its mean and variance.We then apply a 3σ cutoff as the binary CR label for each dark frame.During the training process, these original dark frames are added to the science images, while the CR-labeled masks from the dark frames serve as training targets.
There are two important aspects that should be emphasized.First, the science images used for training are not required to be CR-free.This is because in the composite image (science + dark), the CRs originating from the science image and the dark frame are identically distributed and indistinguishable from each other.For example, let us assume that half of the CRs in the composite image during training came from the dark frame and are correctly labeled as such.Since the model is technically trained to identify dark-frame CRs, which account for 50% of all CRs and are indistinguishable from science-frame CRs, a perfectly trained model would only predict a probability of 50% for true CR pixels at test time.Alternatively, if the science images used for training are free of CRs, then an optimal model would predict a probability of 100%.In conclusion, the labeling noise that arises in science-frame CRs does not degrade the final model performance but shifts the distribution of output CR probabilities, an aspect we further address in Section 4.2.
Second, if one only cares for CR removal for a fixed set of science images, then it is sufficient to use those science images themselves as the training set.In fact, including more science images for training does not improve the model performance for that set of images, but only improves the ability of the trained model to generalize to science images unseen during training.Here, it is useful to separately consider the CR pixels and the non-CR pixels in the training set.Limiting to a small set of science images for training encourages overfitting on the non-CR part of the data set.However, since our application data set is the same as our training set, overfitting on the non-CR part becomes irrelevant.In contrast, including more dark frames may improve the final model performance, as darkframe CRs serve as labeled data.
Apart from proposing a new approach for the training set, we also modify the deepCR architecture by replacing batch normalization with weight normalization.Normalization techniques are frequently used in neural network training to facilitate faster convergence.This is crucial particularly when dealing with astronomical imaging data, which often have a large dynamic range, to stabilize training and expedite convergence.In their original work, Zhang & Bloom (2020) employed batch normalization.Initially, the network undergoes 20 training epochs in "training mode," during which batch normalization applies running statistics of layer activations using statistics of the training batch for normalization.Subsequently, the network is trained for additional epochs with the batch normalization statistics frozen.By adopting weight normalization, we circumvent the need for this twostage training scheme, enabling convergence in 20 epochs.Moreover, batch normalization relies on statistics computed over mini-batches of data, which can introduce variability in the training process, making it sensitive to batch size.In contrast, weight normalization normalizes network weights directly, making it less dependent on batch size.

Data
We have constructed our training and testing data sets from HST WFC3/UVIS imaging data of Local Group galaxies, where the stellar population is well resolved.The density of astronomical sources, which can also serve as sources of confusion during CR identification, is the highest among the three categories as established in Zhang & Bloom (2020).Table 1 summarizes the imaging data used for training.The data sets cover a wide range of filters, from F225W to F814W.For each filter, there are 28 single-frame exposures ( * flc.fits) along with a diverse set of exposure times.Table 2 provides details of ID number and exposure time of each frame used in the data set.We simultaneously train on data from all these filters, and subsequently evaluate and examine the performance of the overall model separately for each filter.
We have incorporated 15 calibrated dark frames (exposure time = 900 s) from the HST WFC3/UVIS instrument (PI: Benjamin Kuhn).A summary is given in Table 1.Individual dark frames underwent preprocessing by CALWF3, which included standard bias correction and postflash correction (Ryan et al. 2016;Martlin & Green 2023).These dark frames contain a substantial number of CRs, alongside other forms of noise and inherent detector artifacts.We employ a two-step iterative sigma clipping on all dark frames to filter out noise and outliers, and generate a ground-truth CR mask for each dark frame.During training, the dark frames are added to science images, with the CR labels serving as the training targets.See Section 2 for detailed training procedures.Finally, we divide the images and masks into image stamps of 256 by 256 pixels to facilitate batch training.Each exposure frame then yields 128 image stamps.Consequently, our overall training set consists of 17,920 image stamps.
All the HST data presented in this paper were obtained from the Mikulski Archive for Space Telescopes (MAST) at the Space Telescope Science Institute.The observations can be accessed via DOI:10.17909/xpmc-7m26.

Results
Using the novel training approach introduced in this work, we present an update to the CR rejection framework deepCR, which is universally applicable across all HST WFC3/UVIS images.This global model is trained on a diverse set of HST data taken from resolved galaxies in the Local Group, covering a comprehensive range of WFC3/UVIS filters from F225W to F814W.Visual examples of CR rejection on a sample image of M31 with resolved stars can be found in Figure 1.For detailed evaluations of network performance compared to the baseline LACosmic model, refer to Zhang & Bloom (2020).In this section, we perform a cross-comparison with the legacy method of eliminating CR artifacts when multiple exposures of the same field are available.Our findings demonstrate that deepCR masking prominently enhances catalog quality beyond the regular legacy method.We also provide insights into robustly determining the threshold for generating binary CR masks from predictions from deepCR probability maps.

Performances on the PHAST Survey
We utilized the PHAST survey catalog (Z.Chen et al. 2024, in preparation) to assess the accuracy of CR labeling determined by deepCR from our newly trained model in comparison to the CRs identified using the legacy method with multiple exposures.
The legacy method used for identifying and cleaning CRs in many treasury programs, such as PHAT and PHATTER, relies on the AstroDrizzle function from the Drizzlepac package (STSCI Development Team 2012; Hack et al. 2013;Avila et al. 2015).AstroDrizzle detects CRs by identifying pixels that appear bright in one exposure but not in other exposures within the same band.These CRs are then identified and flagged in the data quality extensions of the * flt/flc.fits images.When performing point-spread function fitting photometry using the DOLPHOT software package (Dolphin 2000, 2016), these flagged regions are subsequently masked out.Additional details can be found in Williams et al. (2014Williams et al. ( , 2021)).While the legacy method usually performs well in optical images, where a sufficient number of sources are detected and aligned, making CR identification easier, it faces challenges in UV images with typically sparse sources.In such cases, susceptibility to CR contamination is heightened.Despite undergoing the legacy CR flagging and cleaning process via AstroDrizzle, the resulting images still contain a notable number of CRs, particularly in chip gap regions, which are often covered by only a single exposure due to dithering.In these scenarios, CR artifacts can be easily mistaken for sources, leading to poor matching and alignment issues.
The PHAST survey (Z.Chen et al. 2024, in preparation) adopted an observing and photometry strategy similar to those in PHAT and PHATTER, with one of the most significant changes being the preprocessing of all UV images with CR flagging by deepCR.This step minimizes CR contamination and significantly improves cross-band source alignment.The newly constructed PHAST catalog will include resolved stellar photometry for more than 100 million stars in M31, in both the optical (F475W, F814W) and near-UV (F275W, F336W) bands.The availability of multiple exposures of the same field in this survey makes it straightforward to identify good measurements of real stars at each wavelength.Consequently, we can compare the results and impacts of CR identification in the legacy method, which was previously applied to nearly all HST data to which it is applicable, and the results obtained using our newly developed deepCR, which offers greater robustness in CR flagging.
For the PHAST survey, we have adopted the same DOLPHOT criteria as PHAT and PHATTER for identifying good measurements of real stars (referred to as "GST").A complete description of these criteria can be found in Williams et al. (2014, 2021) and Z. Chen et al. (2024, in preparation).The GST catalog sets specific requirements, including a signalto-noise ratio (S/N) greater than 4 for all cameras.It further requires sharpness 2 < 0.2 and crowding < 2.25 for ACS, and sharpness 2 < 0.15 and crowding < 1.3 for UVIS.Each band is independently assessed using the GST criteria, resulting in a GST flag column indicating whether a star meets the criteria.We then use the GST catalog to compare the effects of CR flagging between the legacy mode and the deepCR method.
For a given input image, deepCR generates a probability map of CRs, which is then converted into a binary CR mask using a user-defined threshold (1 for CRs and 0 for nonartifacts).We assessed the performance of deepCR in flagging CRs under different threshold settings.A higher deepCR threshold represents a more conservative approach, masking out only the CRs with the highest predicted probabilities.This could potentially leave some CR artifacts affecting real sources.Conversely, a lower threshold masks out more CRs, which might risk accidentally masking real sources.Thus, choosing an appropriate threshold is crucial for confidently generating binary CR masks, subsequently flagged in the data quality extension of the input image.These images, preflagged with various deepCR masking threshold settings, then underwent the same DOLPHOT photometry routine as the original data.We then compared the measurements of stars that had passed the GST quality checks in the F275W and F336W UV bands.
Different extents of CRs being flagged ultimately impact the identification and alignment of real sources.The deepCR masking process identifies newly detected GST sources that were not classified as such in the legacy mode.However, it may also exclude more sources that were originally part of the legacy GST but are no longer included after the additional CR masking step.In comparison to the legacy mode, the deepCR masking process generally reduces the total number of stars in the final GST catalog by 1%-8%, depending on the applied thresholds.We recorded the number of stars in the resulting GST catalog as a function of deepCR masking thresholds (see Figure 2 upper panels).The number of GST stars drops significantly after the threshold of 0.1, indicating a potentially excessive cutoff point where real stars are mistakenly masked.In addition to the total number of stars in the GST catalog, we also compared GST stars that exhibit prominent features in their color-magnitude diagrams (CMDs).We specifically focused on stars with F336W magnitude in the range [20, 25 mag], and with a color index (F275W -F336W) within ±0.3.Figure 2 lower left displays an example CMD in F275W and F336W and marks the featured region.Figure 2 lower right shows the fraction of GST stars (within the featured region) that are newly identified after deepCR masking, compared to the fraction that is lost after masking with different deepCR masking thresholds.We continuously gain more GST stars within the CMD featured region when decreasing the threshold until 0.08.However, the number of stars lost in the CMD featured region starts to increase rapidly below a threshold of 0.1.We further verify the performance of these cuts on different regions of the survey and at different wavelengths.In light of all considerations, a deepCR masking threshold of 0.1 proves to be the most robust choice for the PHAST survey, effectively flagging CRs while minimizing the risk of mistakenly removing real sources.In applying this to the PHAST survey, we have added approximately 7% of goodquality stars that exhibit distinct CMD features.

Universal CR Masking Threshold Selection
In the absence of unlabeled CRs in the science images during training, a perfectly trained model would predict a probability of 100% for the CR pixels at test time.In this case, a probability threshold of 50% on the output probability mask would produce a balanced binary mask.However, the presence of this labeling noise (Section 2) would lower the optimal probability and thus the CR threshold.
Figure 3 left panel presents a one-dimensional CR probability distribution from one HST visit as an example.The CR probability map exhibits a bimodal distribution, with one peak representing CRs having a high predicted CR probability, and the other corresponding to non-artifacts (background and real sources) with a much lower predicted CR probability, approaching zero.The threshold value at the minimum point between two distributions is considered the cutoff for CR masking (see Figure 3).Pixels with predicted CR probabilities exceeding this threshold are labeled as 1 for CR artifacts, while the remainder are designated as 0 for nonartifact regions.We examined the CR probability distribution for each UVIS filter.This bimodal distribution and the position This bimodal distribution has one peak representing CRs with a high predicted CR probability, and the other corresponding to non-artifacts (real sources and background) with a much lower predicted CR probability, approaching zero.The threshold value (dashed line) at the minimum point between these two distributions is considered the cutoff for CR masking.Pixels with predicted CR probabilities exceeding this threshold are labeled as 1 for CR artifacts, while the rest are designated as 0 for non-artifact regions.The bimodal distribution and the position of the minimum cutoff threshold are independent of filters.Right: three rows present three examples of detailed CR artifacts alongside real sources.The left column displays the raw images, the middle column shows the binary CR masks identified by deepCR using the selected masking threshold, and the right column exhibits the cleaned images with artifact pixels filled with a 3 × 3 median filter.
of the minimum cutoff for CR masking remain consistent across all filters.Within the probability valley, we select a relatively high value of 0.1 as the classification threshold, which minimizes false positives.To evaluate the false-positive rate, we consider CR-labeled pixels that exceed a 5σ brightness level in all observed exposures, assuming that real stars consistently surpass this threshold across all filters.By applying a 0.1 CR masking threshold, only <0.1% of the pixels flagged as containing CRs are potentially contaminated by bona fide stars.This estimate represents an upper limit of the false-positive rate, which will decrease with additional exposures and filters to further refine the identification of real stars.
Figure 3, in the right panels, provides visual examples of CR identification when this masking threshold is applied.The masked pixel values are filled with a 3 × 3 median filter for display.The general minimum cutting threshold, as determined from the 1D CR probability distribution alone, aligns well with the specific masking threshold determined using catalogs of real sources such as PHAST.For future application of our trained model, users have the flexibility to set different thresholds for CR flagging based on their specific scientific goals.For example, users can opt for a more conservative threshold and apply a dilation with a larger pixel radius to enhance cleaning performance.

Summary
We introduce a novel deep-learning-based CR rejection routine for the HST that is applicable across all HST WFC3/ UVIS filters.Unlike the original approach presented in Zhang & Bloom (2020), our new training approach does not require a dedicated training set.Instead, the science frames in need of CR removal serve directly as the training set.We expect that our training methodology will be useful for developing deepCR models for new instruments, particularly ground-based instruments.
To facilitate future applications, we have made our trained WFC3/UVIS model publicly available as part of the deepCR "model zoo5 ."In comparison to the traditional routines for removal of CR artifacts commonly employed in many HST data, our approach represents a significant advancement, especially in sparse UV images.It further optimizes crosswavelength and cross-telescope alignments, which are crucial for panchromatic survey catalogs.In applying this on the PHAST survey, we have added approximately 7% of goodquality stars that exhibit distinct CMD features.Lastly we outline how users can confidently set the masking threshold for generating binary CR masks from predictions from the deepCR probability maps, tailored to different scientific needs.

Figure 1 .
Figure 1.Top left: WFC3/UVIS calibrated dark frame (HST:16568).Top right: WFC3/UVIS image in F336W of M31 (HST:16801).Bottom left: the result of adding the dark frame to the raw image.Bottom right: the same image with all CRs removed by deepCR, including from both the raw image and dark frame.Masked pixel values are filled with a 3 × 3 median filter.

Figure 2 .
Figure 2. Top left: fraction of the total number of stars in the GST catalog from one example HST visit as applied with different deepCR masking thresholds compared to the legacy mode.The resulting number of GST stars drops quickly below the threshold of 0.1.Top right: fraction of GST stars that are added or lost with different deepCR masking thresholds, respectively, which together construct the overall curve in the left panel.Bottom left: UV color-magnitude diagram of this example HST visit with F275W and F336W measurements that pass our quality cuts in both bands as GST stars.The red box marks the CMD features of the most importance.Bottom right: fraction of GST stars that are added or lost within the CMD featured box as a function of deepCR masking thresholds.Dashed lines represent the results when the featured box is dithered to avoid selection edge effects.deepCR masking keeps adding more GST stars within the CMD featured region until the threshold is around 0.08, while the masking starts to lose featured stars prominently below the threshold of 0.1.

Figure 3 .
Figure3.Left: example of a one-dimensional CR probability distribution from the deepCR-predicted CR probability map.This bimodal distribution has one peak representing CRs with a high predicted CR probability, and the other corresponding to non-artifacts (real sources and background) with a much lower predicted CR probability, approaching zero.The threshold value (dashed line) at the minimum point between these two distributions is considered the cutoff for CR masking.Pixels with predicted CR probabilities exceeding this threshold are labeled as 1 for CR artifacts, while the rest are designated as 0 for non-artifact regions.The bimodal distribution and the position of the minimum cutoff threshold are independent of filters.Right: three rows present three examples of detailed CR artifacts alongside real sources.The left column displays the raw images, the middle column shows the binary CR masks identified by deepCR using the selected masking threshold, and the right column exhibits the cleaned images with artifact pixels filled with a 3 × 3 median filter.

Table 1
Hubble WFC3/UVIS Observations Used in Construction of the Data Set The full table is available online.All the HST data presented in this paper were obtained from the MAST at the Space Telescope Science Institute.The observations can be accessed via 10.17909/xpmc-7m26.
(This table is available in its entirety in machine-readable form.)