CSST Strong Lensing Preparation: a Framework for Detecting Strong Lenses in the Multi-color Imaging Survey by the China Survey Space Telescope (CSST)

Strong gravitational lensing is a powerful tool for investigating dark matter and dark energy properties. With the advent of large-scale sky surveys, we can discover strong lensing systems on an unprecedented scale, which requires efficient tools to extract them from billions of astronomical objects. The existing mainstream lens-finding tools are based on machine learning algorithms and applied to cut-out-centered galaxies. However, according to the design and survey strategy of optical surveys by CSST, preparing cutouts with multiple bands requires considerable efforts. To overcome these challenges, we have developed a framework based on a hierarchical visual Transformer with a sliding window technique to search for strong lensing systems within entire images. Moreover, given that multi-color images of strong lensing systems can provide insights into their physical characteristics, our framework is specifically crafted to identify strong lensing systems in images with any number of channels. As evaluated using CSST mock data based on an Semi-Analytic Model named CosmoDC2, our framework achieves precision and recall rates of 0.98 and 0.90, respectively. To evaluate the effectiveness of our method in real observations, we have applied it to a subset of images from the DESI Legacy Imaging Surveys and media images from Euclid Early Release Observations. 61 new strong lensing system candidates are discovered by our method. However, we also identified false positives arising primarily from the simplified galaxy morphology assumptions within the simulation. This underscores the practical limitations of our approach while simultaneously highlighting potential avenues for future improvements.

1. INTRODUCTION robinmartin20@gmail.com nan.li@nao.cas.cnarXiv:2404.01780v1[astro-ph.IM] 2 Apr 2024 Strong Gravitational Lensing represents a significant category of celestial phenomena, occurring as light emitted by distant cosmic entities traverses regions occupied by supermassive objects such as galaxies, galaxy clusters, or black holes (Birrer et al. 2022;Meneghetti et al. 2013;Treu 2010;Shajib et al. 2022;Treu et al. 2022;Kneib & Natarajan 2011;Vegetti et al. 2023).Strong lensing systems play a vital role in measuring mass distributions, confining cosmological parameters like the Hubble constant and dark energy density, and affords scientists the opportunity to investigate properties of remote celestial objects (Bradač et al. 2002;Treu & Koopmans 2002;Auger et al. 2010;Sonnenfeld et al. 2015).It furnishes invaluable insights into understanding the early universe and the evolution of the cosmos.
Currently, our knowledge of strong lensing systems is limited to a few hundred instances.However, with ongoing and forthcoming surveys (Oguri & Marshall 2010;Jacobs et al. 2017), such as the Legacy Survey of Space and Time (LSST) (Ivezić et al. 2019;Jacobs et al. 2017), Euclid (Laureijs et al. 2011), and the China Space Station Telescope (CSST) (Zhan 2021), we anticipate a substantial increase in the identification of strong lensing systems.The task of detecting them in future surveys presents a formidable challenge.The intricate shapes of images, including arcs and multiple images, often stem from the complex interplay of strong lensing systems influenced by the internal structures of galaxies and their surrounding environments.Furthermore, additional sources of interference, such as the interal structure of galaxies, noise from background and effects brought by the point spread function (PSF), add to the complexity, rendering the detection of strong lensing systems an even more daunting endeavor.
As the volume of observation data continue to expand, the manual approach of visually inspecting strong lensing systems has proven insufficient.Therefore, the adoption of appropriate automated algorithms for detection of strong lensing systems become imperative.Previous automated detection methods have adopted various strategies.These include the search for distinctive features of strong lensing systems, such as arcs and rings, the fitting of geometric parameters to quantify the extent of these features, and searching for blue residuals in images extracted from the galaxy (Webster et al. 1988;Bacon et al. 2000;Courbin et al. 2000;Smith et al. 2001;Lenzen et al. 2004;Alard 2006;Estrada et al. 2007;Seidel & Bartelmann 2007;More et al. 2012;Gavazzi et al. 2014;Jacobs et al. 2017).Another approach has involved generating models of potential lensing galaxies and comparing them with real observation data to pinpoint potential lensing occurrences (Brault & Gavazzi 2015;Jacobs et al. 2017).Certain methods have involved the removal of galaxies from observation images (Chan et al. 2015;More et al. 2016;Jacobs et al. 2017), while others have delved into the analysis of colors and shapes of lensed quasars, employing parameterization techniques for subsequent analysis (Brault & Gavazzi 2015;Jacobs et al. 2017).The aforementioned approaches have indeed discovered a notable amount of strong lensing system candidates.However, these methods necessitate the manual identification of candidates of strong lensing systems, alongside intricate data analysis.Due to the limitations of modern monitors, which can only display images with a finite range of gray scale levels and channels, many of the physical properties associated with strong lensing systems, contained within variations in gray scale or color, cannot be fully utilized in the detection process.Moreover, the elevated computational intricacy associated with these methods renders them unsuitable for processing a huge volume of data.
With the development of machine learning techniques, their application within the field of astronomy has increased.Compared to traditional methods, machine learning algorithms exhibit a heightened ability to efficiently manage large datasets and distinguish valuable features from them.Their swift adaptability to new observation data and robust generalization capabilities render them especially fitting for a large volume of astronomical data.In particular, in 2015, researchers introduced a deep neural network model designed to classify galaxy morphology.This model leveraged the translational and rotational symmetries inherent in galaxy images to achieve classification and redshift estimation (Dieleman et al. 2015;Jacobs et al. 2017).Within the realm of galaxy evolution studies, Schawinski et al. (2017) have proposed deep generative models, notably generative adversarial networks (GANs) (Goodfellow et al. 2014;Jacobs et al. 2017), and devised an innovative deconvolution technique to recover features from SDSS galaxy images.
In the realm of strong lensing systems detection, Petrillo et al. (2017) have conducted an analysis of Kilo Degree Survey (KiDS) data using Convolutional Neural Networks (CNN).Their efforts yielded successful identification of several candidates of strong lensing systems within the KiDS dataset.Metcalf et al. (2019) have implemented a variety of methods, including visual inspection, arc and ring finders, support vector machines (SVM), and convolutional neural networks (CNN) in detection of strong lensing systems.These methods provided approaches for strong lensing systems finding and classification in Euclid data.Jia et al. (2022) have introduced a transformer-based deep neural network (DETR) for strong lensing systems detection, which demonstrated noteworthy proficiency in identifying strong lensing systems on the scale of galaxy clusters.The transformer model is designed to concentrate its attention on key regions like distorted images or arcs during the process of strong lensing systems detection.Through its adaptive allocation of attention across different components, the algorithm effectively captures both local and global information, thereby enhancing the efficiency of the detection process.
Building upon the methodology introduced in Jia et al. (2022), we have further implemented additional efforts.These enhancements incorporate the utilization of a visual transformer employing a sliding window strategy (Liu et al. 2021).This mechanism divides the image into an array of overlapping sub-windows, treating each sub-window as an input for processing, which addresses the impact of input length restrictions and the imbalance of positive and negative samples on the results.By operating at the window level, the model gains the ability to effectively capture features of diminutive and densely clustered objects, thereby augmenting its capability to detect targets.With these adjustments, our approach is now capable of handling input data of varying lengths, and it results in more equitable feature extraction for a more comprehensive representation of strong lensing phenomena.
Furthermore, given the current limited number of detected strong lensing instances and various forms of noise that affect observation images, we have developed an end-to-end pipeline based on the previously discussed detection strategy.This pipeline encompasses three distinct components: image simulation for training data generation, image pre-processing, and target detection.The image simulation component serves to generate a substantial volume of simulated images.These simulated images incorporate prior knowledge about strong lensing systems, which is then utilized to train the detection algorithm.The subsequent image pre-processing step is employed to process real observation images.This involves deconvolution of observation images with predefined PSFs from the telescope and adjusting the grayscales of these images.The image pre-processing component enhances the visibility of celestial objects with low signal-to-noise ratios.For the detection component, the simulated images are used as the training data set.After training, real observation images known to contain strong lensing systems are processed using the image pre-processing algorithm.The outcomes of the image pre-processing algorithm are employed to fine-tune the previously trained detection algorithm.Finally, all real observation images undergo pre-processing before being subjected to the detection algorithm.This sequential process facilitates the identification of strong lensing systems in the observation data.This paper is structured as follows.In Section 2, we elaborate on the methodology employed to generate simulated images featuring strong lensing systems.In this paper, we generate simulated images with both galaxy scale and galaxy cluster scale strong lensing systems, therefore, our method could detect strong lensing systems with both galaxy scale and galaxy cluster scale.In this article, the primary focus is on detecting strong lensing systems at the galaxy scale.In Section 3, we outline the procedures involved in data pre-processing.In Section 4, we introduce the detection algorithm for strong lensing systems, leveraging the attention mechanism.We subsequently evaluate the efficiency of our model using simulated data acquired through the CSST.In Section 5, we establish a comprehensive pipeline for detecting strong lensing systems and assess its performance using public media images from Euclid Early Release Observations and images from the Legacy Imaging Surveys.Finally, in Section 6, we present our conclusions and anticipate potential avenues for future research.

THE METHOD TO GENERATE SIMULATED DATA
The China Space Station Telescope (CSST) is a 2-meter space telescope slated to launch in approximately 2025, operating within the same orbit as the China Manned Space Station (Zhan 2021).As a significant scientific project within the Space Application System of the China Manned Space Program, CSST is designed to be a large field of view, covering around 1.1 square degrees, with high spatial resolution of roughly 0.1500 arcseconds at 633 nm for its Survey Camera (SC).This versatile telescope spans multiple wavelength bands, from the near-ultraviolet (NUV) to the near-infrared (NIR), and it will simultaneously conduct photometric and slitless spectral surveys across an extended sky area of 17,500 square degrees.
Generating simulated data, as discussed in our previous papers (Jia et al. 2022(Jia et al. , 2023a)), serves a crucial purpose in the detection of celestial objects, particularly when sufficient training data is lacking.Simulated data provide valuable prior information about the targets we aim to detect.These simulations are important in training neural networks, enabling them to swiftly adapt to real observation data through transfer learning.Additionally, we leverage simulated CSST data to make predictions regarding the scientific outcomes of the CSST in the realm of strong lensing detection.The simulation code consists of two key components: the strong lensing simulator and the imaging simulator tailored for the CSST, both of which will be briefly introduced below.
XU LI ET AL.
We use the PICS strong lensing system simulator without introducing noise or other effects (Li et al. 2016).Our simulation process, akin to Madireddy et al. (2019), involves six steps: (1) generating lens and source populations based on statistical properties, (2) constructing mass and light models for foreground lensings, (3) calculating lens deflection fields, (4) creating light profiles for background source galaxies, (5) conducting ray-tracing simulations to generate strong lensing images using deflection fields and source light profiles.
The populations of lenses and sources are built upon an advanced extragalactic catalog (Korytov et al. 2019) named Cos-moDC2, which is a synthetic galaxy catalog to support precision science with the Legacy Survey of Space and Time1 (LSST).It covers a 440deg2 of sky area to a redshift of z = 3 and is complete to a magnitude depth of 28 in the r-band.Various properties, including stellar mass, morphology, spectral energy distributions, broadband filter magnitudes, host halo information, and weak lensing shear, characterize each galaxy in the catalog.The CosmoDC2 has undergone a wide range of observation-based validation tests to ensure consistency with real observations.The official release of the cosmoDC2 dataset and documentation can be found here 2 .
Lenses are modeled as a combination of dark matter halos and galaxies.The dark matter halo follows the Navarro, Frenk & White (NFW) model, given by Navarro et al. (1996), as shown below: where ρ c is a characteristic density and r s a scale radius.Its lensing potential is given by Golse & Kneib (2002), with κ s = ρ c r s Σ −1 crit , θ s = r s /D a (z l ), and x = θ /θ s .Where D a (z l ) stands for the angular diameter distance from the observer to the lens plane, and To achieve an elliptical lensing potential φ ε (x) ≡ φ (x ε ), Golse & Kneib (2002) defined ε = (1 − q 2 )/(1 + q 2 ) to create the below elliptical coordinate system then substituted x by x ε .
Then the deflection angles can be calculated numerically according to the relation Therefore, the deflection angles for the dark matter halos are described by {x n f w,1 , x n f w,2 , M vir , c vir , q n f w , φ n f w , z l , z s }, where (x n f w,1 , x n f w,2 ) is the angular position of the dark matter halo in the field of view, M vir is the virial mass, c vir is the concentration, q n f w and φ n f w are the axis ratio and positional angle.z l and z s represent the redshifts of the lens and source planes, respectively.
The density profile of galaxies are modeled as a singular isothermal ellipsoid (SIE), has deflection angles given by Kormann et al. (1994); Keeton (2001), where q is the minor to major axis ratio and θ E is an effective factor to represent Einstein radius, Therefore, the complete parameter set required by equations 5 to equation 7) is {x 1 , x 2 , σ v , q gal , φ gal , z l }, where (x 1 , x 2 ) is the angular position of the galaxy centers in the field of View, σ v is the velocity dispersion of the galaxy, q gal is the axis ratio, φ gal is the position angle, and z l is the redshift of the lens plane.The parameters x 1 , x 2 , q gal , φ gal , z l are taken directly from the cosmoDC2 catalog.σ v is derived from the L − σ scaling relation from the bright sample of Parker et al. (2007) given by where, log 10 (L/L ⋆ ) = −0.4(magr − mag r⋆ ), and mag r is the apparent r-band magnitude of the galaxy given by the cos-moDC2 catalog.We use apparent r-band magnitudes (magr) from the CosmoDC2 catalog and adopt an evolving mag r⋆ = +1.5(z− 0.1) − 20.44 along with redshift as per More et al. (2016) and Faber et al. (2007).
For a given lens, summing all gravitational lensing contribution of its components together, we obtain the total deflection angle α tot then the lensing equation y = x − α tot .Hence, the strong lensed arcs can be generated by tracing light rays from the lens plane to the source plane, where the source images are drawn according to their positions and light profiles.To ensure prominent lensing features, we require z s > z l + 0.5 for source galaxies and locate one source manually by randomly choosing source positions in regions with lensing magnifications exceeding 20 on the source plane.Galaxies in the light cone, whether lensed or not, are modeled with composite Sersic profiles based on CosmoDC2 parameters.Lensed arcs are rendered with Sersic profiles on ray-traced grids.
With the above strategy of generating strong lensing system images, we utilize the CSST simulator, an End-to-End pipeline integrating cosmology simulation, gravitational ray-tracing, optical instruments, and imaging, to generate mock observation images3 .In this simulator, galaxies consist of bulges and disks, which are modeled with composite Sersic profiles based on CosmoDC2 parameters, and they are convolved with simulated PSFs for CSST.These galaxies are then imaged onto detectors with dimensions of 9216 × 9232 pixels with pixelscale of 0.074 arcseconds.Then we choose strongly lensed arcs in three steps: (1) calculate the reference Einstein Radii θ E,re f of all galaxies and galaxy clusters in the above field of view by setting z s = 10.0; (2) remove the deflectors with θ E,re f ) < 0.2"; (3) randomly choose 1% of the rest deflectors as lenses.We then generate strongly lensed arcs for the selected lenses as described in the above paragraphs, and the pixelized images of the strongly lensed arcs are passed into the CSST-image-simulator afterwards.Subsequently, both galaxies and strong strong lensing systems are imaged together onto the detectors.
Rather than relying on parametric models, the PSF is derived from an optical design model.To produce a comprehensive set of realistic PSFs that consider the impact brought by the optical system, an optical emulator is created to simulate high-fidelity PSFs for CSST.This optical emulator for CSST comprises six distinct modules, each simulating optical aberrations stemming from mirror surface roughness, fabrication errors, CCD assembly errors, gravitational distortions, and thermal distortions.Additionally, the simulated PSF incorporates two dynamic errors arising from micro-vibrations and image stabilization.We have further introduced various sources of noise into the simulated CSST images, encompassing shot noise, sky background, and detector effects.To accomplish this, we have employed Galsim (Rowe et al. 2015) to simulate photon generation from a given galaxy, accounting for the throughputs of the CSST system, including mirror efficiency, filter transmission, and detector quantum efficiency.We have also introduced Poisson noise originating from both the sky background and the dark current of the CCD detector.Specifically, we have set the i-band background level to 0.212 e − /pixel/s, with a dark current of 0.02 e − /pixel/s, resulting in an average of approximately 35 e − /pixel in a 150s exposure.Furthermore, we have incorporated read noise using a Gaussian distribution with a standard deviation of approximately 5.0 e − /pixel.In simulating the creation of mock galaxy images on the detector, we have also accounted for bias and applied gain factor of the detector.Within this section, we establish the data pre-processing method for images captured by the CSST.It is worth noting that these steps can be adjusted, or even omitted, when they are used to process observation data obtained by other telescopes.Moreover, should the need arise, parameters within these processes can be readily fine-tuned to adapt with images obtained by other surveys.

The Image Cropping Step
Given that images acquired by the CSST are relatively large, with size of 9232 × 9216 pixels, we find it is necessary to crop them into smaller sections, each with size of 1000 × 1000 pixels.This step is essential to align with hardware constraints, as the detection algorithm needs approximately 14.88 GB GPU memory for an image of dimensions 1000 × 1000 pixels.Given that this study uses an RTX 3090 Ti GPU with a maximum memory capacity of 24 GB, this image size represents the upper limit of what the GPU can accommodate.Moreover, to ensure that images of strong lensing systems remain intact and undivided, we have introduced overlapping segments, with each segment featuring an overlap region of 80 pixels.This approach results in the generation of 100 smaller images following the cropping of a single original larger image.It is worth noting that the image cropping step can be circumvented if access to a GPU with greater memory capacity is available.

The Image Deconvolution Step
Astronomical imaging faces inherent constraints in angular resolution, primarily attributed to factors like atmospheric turbulence, diffraction, instrumental effects, and others.This limitation is rooted in the mathematical equivalence of Fraunhofer diffraction to a Fourier transform, indicating the absence of signal on Fourier scales smaller than the diffraction limit.Consequently, when the PSF sets the diffraction-limited optics of the telescope, resulting in zero signal-to-noise ratio on scales smaller than the PSF, the image information becomes incomplete.Additional factors, such as jitter, may exacerbate the performance degradation.Leveraging advancements in machine learning, deconvolution neural networks have been employed to augment and suppress noise in astronomical images across various wavelengths, with the goal of restoring diffraction-limited performance Lauritsen et al. (2021).In this study, we assume that we can obtain PSFs of the CSST and these images can be used to generate simulated blurred images.These blurred images and the original images can be used as the training set for deep learning based image restoration algorithms.We employ the PSF-NET method for image deconvolution, as presented by Jia et al. (2020).The PSF-NET contains two neural networks, a PSF neural network and a restoration neural network.The PSF neural network models the PSF, facilitating the conversion of high-resolution images into blurred counterparts.The restoration neural network, named RESTORE, encapsulates the deconvolution algorithm, thereby converting blurred images back to high-resolution forms.Both of these neural networks include residual blocks, in addition to multiple convolutional and transposed convolutional blocks, as illustrated in Figure 1. .During the training phase, both the PSF neural network and the RESTORE neural network are trained.This approach enhances training efficiency while mitigating the risk of overfitting.Subsequent to training, the RESTORE neural network could be used directly to restore blurred images.In accordance with the functionalities of these two networks, we adjust the loss function of these neural networks as depicted in Equation 9, following the method proposed in (Lv et al. 2022;Jia et al. 2024).Here, L idet signifies the identity loss function, L rec denotes the cyclic loss function, and L f l stands for the focal frequency loss function.
In the above equation, both L idet and L rec are used to minimize the mean square error that exists between the original image and the image restored by the neural network.The PSF neural network and the RESTORE neural network share an encoder-decoder structure.However, the decoder section encompasses several convolutional and upsampling layers.These layers introduce gaps in the restored image, leading to loss of information and the introduction of artifacts.Such issues can disrupt the accurate interpretation and analysis of the image.To enhance the efficacy of our neural network, we introduce a focal frequency loss, denoted as L f f l (Jiang et al. 2021).This particular loss function contributes to the improvement of neural network performance by incorporating regularization weights into the density of the power spectrum, W. The loss of focal frequency, L f f l , is equivalent to the mean square error computed in the spatial frequency domain between the restored image and the original image, as shown in Equation 10.In this equation, FFT denotes the Fast Fourier Transformation, and α corresponds to the regularization parameter, which is assigned a value of 1.
In the provided equation, L idet and L rec are employed to minimize the mean squared error (MSE) between the original image and the image restored by the neural network.Adhering to the image restoration principles discussed earlier, we have performed restoration on the simulated data that contained noise.Following this, we can compare the restored image with the original noisy after applying the asinh transformation.The restoration results are visualized in Figure 2. As shown in this figure, the quality of images have been imporved with our method.We use the peak signal-to-noise ratio (PSNR) defined in Wang et al. (2004); Xu et al. (2014) to evaluate the quality of these images.In this case, the PSNR has been improved from 48.22 to 51.77 by our method.This highlights how our method can effectively improve the signal-to-noise ratio of faint celestial targets.

The Grayscale Transformation
Step XU LI ET AL.
In contrast to typical images, astronomical images frequently encompass celestial objects with significantly wider grayscale ranges.Consequently, certain images exhibiting a low signal-to-noise ratio might appear similarly to the background when compared to the overall grayscale.While deep neural networks are capable of acquiring non-linear mapping functions, it might be an inefficient utilization of computational resources if we can establish straightforward transformation functions.A grayscale transformation function is one example of such functions.The asinh function, introduced by Lupton et al. (2004), serves to enhance the signal-to-noise ratio of faint targets.This function is defined by Equation 11, where x is a real number.When the value of x is relatively large, asinh(x) can be approximated as ln(x).When the value of x is relatively small, asinh(x) can be approximated as x.
When handling these images, our initial step involves computing the maximum and minimum values within the provided image to establish the data range.This range is then used to derive the scaling factor for the data.It is important to highlight that, when analyzing processed images using our method where the flux has been calibrated across different bands, we can utilize the maximum and minimum values.However, if dealing with real observation data, it is crucial to meticulously choose suitable values based on the photometry zero point and the full well depth of the observed images.The equation to calculate this scaling factor in this paper is illustrated in Equation 12, where max_value and min_value respectively represent the maximum and minimum values of the image.After calculating the scaling factor, the data can be subjected to an asinh transformation, defined in equation 13, where data_value corresponds to the original data value, asinh_value signifies the value after undergoing the asinh transformation, while min_value and scale_ f actor respectively denote the minimum value and the scaling factor determined during the scaling factor calculation.The figure illustrates the image following the grayscale transformation process.As shown in figure 2 (c), celestial objects with a low signal-to-noise ratio become distinctly visible.The algorithm proposed in this paper is built upon the foundation of the transformer model.The transformer, originally applied extensively in the realm of Natural Language Processing (NLP), has yielded favorable outcomes in numerous NLP tasks (Vaswani et al. 2017).Building on these successes, the transformer architecture has been extended to tasks involving image processing.However, in contemporary transformer-based neural networks, tokens are confined to fixed sizes (Vaswani et al. 2017;Dosovitskiy et al. 2020;Liu et al. 2021), thus restricting the input to a fixed number of dimensions.This presents a challenge when it comes to images, which inherently possess thousands of dimensions and requires precise predictions at the pixel level.As a consequence, the transformer faces difficulties in effectively handling high-resolution images, largely due to the quadratic increase in computational complexity with respect to image size.
To surmount these challenges, we implement an enhancement to the transformer architecture by incorporating the Swin Transformer model as a core feature extraction module (Liu et al. 2021;Jia et al. 2023b).The Swin Transformer has two noteworthy advantages.Firstly, its hierarchical structure aligns seamlessly with the feature pyramid network (Lin et al. 2017;Liu et al. 2021)) or the U-Net (Ronneberger et al. 2015;Liu et al. 2021), enhancing its compatibility with such frameworks.Secondly, it introduces the concept of moving windows, where self-attention calculations are confined to a local window, as opposed to global attention calculations (Liu et al. 2021).This architectural choice significantly mitigates model complexity, reducing the quadratic complexity (N 2 ) to linear complexity.These attributes collectively render the Swin Transformer an adaptable feature extraction model, well-suited for a range of visual tasks.The architecture of the model is shown in Figure 3.Here is an overview of its structure: The initial step involves feeding preprocessed images into the Swin Transformer, which serves as the feature extractor.Subsequently, the outputs from the Swin Transformer are directed into the RPN network for binary classification and regression of positions.Regions potentially containing strong lensing systems are singled out, and using the ROIAlign layer, the feature map is cropped and aligned to generate fixed-size feature tensors.These tensors are then independently transmitted to the classification branch, bounding box regression branch, and mask segmentation branch, each responsible for different aspects of the object detection task.By amalgamating the outputs from these three branches, the network can simultaneously provide information about comprehensive insights regarding the position and the mask of each detected strong lensing systems.
Figure 4 shows the precise architecture and data flow of the Swin Transformer.An image, sized H × W × C, is divided into 4 × 4 patches, effectively transitioning from pixels to patches as the smallest unit of representation.This procedure generates a feature map of H 4 × W 4 × (C × 4 × 4).This feature map undergoes three distinct stages.In the first stage, a linear embedding transforms the feature map from In the subsequent three stages, patch merging is executed, performing operations subsequent to the amalgamation of neighboring patches.For the classification networks, the final output is obtained by connecting a Layer Norm layer, a global pooling layer, and a fully connected layer.
In each stage, the initial input undergoes downsampling via a Patch Merging layer.This layer groups neighboring pixels into patches, which are subsequently combined to form four feature maps.These four feature maps are then merged in the depth dimension.Following this, a LayerNorm and fully connected layer are employed to linearly adjust the depth of the feature map, halving its original size to C 2 .After traversing the Patch Merging layer, the height and width of feature maps are halved, while its depth is doubled.To manage high-resolution images with many tokens more effectively, the Swin Transformer implements a technique known as local window self-attention.This approach reduces computational complexity while maintaining commend-  able outcomes.In essence, the model concentrates on local regions at a given moment, rather than attempting to process the entire image concurrently.This method facilitates a step-by-step image processing approach, making it feasible to handle a lot of images without overwhelming computational resources.To ensure coherence among windows and boost efficiency, the Swin Transformer adopts a strategy called "shifted window partitioning".This strategy mitigates computational load while ensuring interconnections among the windows, as shown in Figure 5.
As depicted in Figure 5, Layerl contains only 4 windows, while Layerl + 1 encompasses 9 layers.Multi-Scale Analysis (MSA) is executed within each of these windows.To alleviate computational burden, an Efficient Batch Computation approach is employed, utilizing shifted configurations for resolution.This is shown in Figure 6.The method involves filling relatively small windows to dimensions of M × M through a shifting operation, while masking the areas that are filled with attention.In Figure 6, after cyclically shifting to the upper left, the batch window encompasses several sub-windows that are not adjacent in the feature map.The number of batch windows aligns with the window partition count.During the computation step, selfattention is applied solely to each sub-window, with a few others being masked out.After the computation step, the sub-window is repositioned to its original orientation.In simpler terms, this method enlarges windows by shifting them and disregards specific values in the process.This approach yields a batch window composed of disconnected sub-windows.Subsequently, selfattention is carried out individually within each sub-window, and once calculations conclude, the original arrangement is restored.Due to the large number of parameters of the Swin Transformer, utilizing random weights for neural network training would consume substantial computational resources.Therefore, we initialize the network with the weights provided by K-H-Ismai4 .As described in Section 2, the simulated data serves as the training and validation datasets for the neural network.Our simulation assumes a galaxy density of 55.6 galaxies per arcmin 2 detectable by the CSST, resulting in an expected 1.4 strong lensing systems per arcmin 2 .This galaxy density aligns with the Stage IV galaxy surveys.The training dataset consists of 3,000 images, each with dimensions of 1000 x 1000 pixels.Training and validation sets are randomly split with a 7:3 ratio.It is worth noting that if the model is intended for application with data from other sky survey projects, utilizing images generated by specific simulation codes or real observation data for fine-tuning would be necessary, given the specialized nature of the CSST simulation code.

Training of the Strong Lensing System Detection Algorithm
We generate labels for these images using the following approach.Images of strong lensing systems are enclosed with bounding boxes, and we establish these boxes using center coordinates and a fixed size.Since images of strong lensing systems often exhibit varying sizes, we opt for bounding boxes with a fixed dimension of 50 × 50 pixels.This strategy enables precise object localization, ensuring that a broader range of objects is captured while minimizing the inclusion of extraneous background details.Consequently, this approach enhances the efficacy of training the model.During the training stage, the learning rate for the Swin Transformer is set as 0.001.As we train the neural network, a consistent decrease in loss could be observed.After 100 epochs of training, which cost around 72 hours in a computer with one RTX 3090 Ti GPU, the loss stops decreasing, which shows that the model is convergence.The learning curve is depicted in Figure 7.

PERFORMANCE EVALUATION OF THE PIPELINE
In this section, we will assess the performance of the pipeline using simulated CSST data and explore the potential scientific insights the CSST could provide for studying strong lensing.Additionally, we will demonstrate the essential nature of various stages within our pipeline.

Evaluation Metrics of the Detection Results
In object detection tasks, the accuracy of position prediction is often evaluated using a metric known as Intersection over Union (IOU) ratio.This ratio quantifies the degree of overlap between the predicted bounding box and the actual (ground truth) box.Since the position and size of prediction results are both important for object detection tasks, we calculate the IOU value between our prediction and the ground truth box.A higher IOU value indicates more precise predictions.The IOU calculation is defined by Equation 14, where Intersection pertains to the overlap region between the predicted bounding box and the actual (ground truth) bounding box, while Union encompasses the entirety of the space covered by both the predicted bounding box and the ground truth bounding box.In simpler terms, Intersection corresponds to the shared area and Union represents the total area covered by both boxes.When evaluating object detection results using the IOU metric, we need to select a specific threshold.If the value IOU between the predicted box and the ground truth box exceeds this threshold, we consider the predicted box to have successfully detected the object, and the model classifies it as true positive (T P).On the other hand, if the IOU value falls below the threshold, the predicted box is considered an unsuccessful detection.In this case, the model designates it as either a false negative (FN) if it failed to detect a true object or a false positive (FP) if it mistakenly identified a non-object region as an object.This thresholdbased approach helps us evaluate the performance of the model in distinguishing between true positives, false negatives, and false positives.
In order to evaluate the performance of our model, we compute two crucial metrics: precision and recall.Precision informs us about the ratio of accurately classified positive results relative to all positive results, and its calculation is based on the following equation 15, The recall rate, often referred to as the "detection rate," defines the percentage of accurately recognized bounding boxes out of all the actual positive boxes.Its computation is determined by the formula depicted in Equation 16, The Precision and Recall are interconnected metrics with a reciprocal relationship.As we elevate the IOU threshold, both Precision and Recall typically experience a decrease.This signifies that the model becomes more precise in recognizing true positives, yet it could overlook certain positive instances, resulting in a decline in Recall.Conversely, by lowering the threshold, the Recall can increase as the model becomes more attuned to positive instances.However, this improvement in Recall comes with a trade-off: the Precision may decrease, potentially leading to more false positive predictions by the model.
For an accurate model evaluation, achieving a balance between precision and recall is essential, and this balance can be attained by determining an optimal threshold.In this way, we can achieve a model that correctly identifies positive samples while minimizing the occurrence of false positives.To balance between these parameters, we need to use the Precision-Recall (PR) curve.The PR curve illustrates the interplay between precision and recall.Precision indicates the correctness of positive predictions, while recall gauges the ability of the model to detect all actual positive instances.On the PR curve, we plot recall on the x-axis and precision on the y-axis.By examining this curve, we can make informed decisions about the performance of the model and select the threshold that aligns with our specific requirements.Typically, the closer the PR curve resides to the upper-right corner, the better the performance is.To quantitatively measure the trade-off between precision and recall for an individual class, we compute the Average Precision (AP), which involves calculating the area under the Precision-Recall curve.Additionally, the mean Average Precision (mAP) represents the average AP value across all classes, offering an assessment of the overall model performance.In this paper, we employ PR curves and mAP to compare the performance of different trained models, aiding in our comprehensive evaluation of their efficacy.

Performance Evaluation of the Detection Algorithm
In this subsection, we will show the performance of the detection algorithm with the CSST simulation data.Since there are no data pre-processing steps used in this subsection, we use the detection results as the baseline.Two sets of simulated data are used in this subsection.The CSST simulation data with no PSF variations, read out noise or back ground noise are considered as clear data and CSST simulation data with realistic PSF variations and different noises are considered as noisy data.Two neural networks are trained with clean data and noisy data separately through the same procedure.The results are shown in figure 8.
We designate RT3 to denote clear data and RT4 for noisy data.As illustrated in Figure 8(a) and Table 1, setting IOU to 0 yields a precision rate of approximately 30%, while the recall rate is 90%.This indicates that the model can effectively detect nearly all strong lensing systems, albeit with only 30% of these detections being genuine strong lensing systems.However, when IOU is set to 0.5, as shown in the RT3 column in Table 2, there is a noticeable decrease in the recall rate to 70%, suggesting the model  As shown in these figures, noise and blur will introduce very strong effects to detection results and we need to design some algorithms to reduce these effects.In these figures, we can find that if we set smaller IOU, the AP will be larger, which means we can obtain higher recall rate and precision rate at the cost of lower position regression accuracy.
could identify roughly 70% of all strong lensing systems.In Figure 8(b) and Table 1, the influence of noise becomes apparent, resulting in a detection accuracy of only 69.6% when IOU is set at 0. With a recall rate of 80%, the precision rate drops to merely 53.3%, indicating that only about half of the detected results are strong lensing systems, while the majority are false positives.This low precision rate would necessitate significant human intervention in practical applications.Moreover, with increasing IOU values, as exemplified by the data in the RT4 column of Table 2, the accuracy rate sharply decreases, reaching 25.5% when IOU is set to 0.5.Based on the outcomes presented above, it is apparent that while our methods can yield some results, there are still certain aspects that need improvement.When employing a low IOU threshold, it is expected that numerous detection results would exhibit relatively low precision rates, demanding extensive subsequent human interventions.Consequently, there is a need to further refine our approach and develop complementary methods that can enhance the efficiency of detection.

Performance Evaluation of the Detection Pipeline
As discussed earlier, in order to obtain scientific outcomes with acceptable costs from observation data collected by the CSST, there is an urgent need to enhance the detection capability of our algorithm.Given the impact of varying levels of noise and variable PSFs on images, one strategy involves leveraging image restoration algorithms such as the PSF-Net proposed by Lv et al. (2022); Jia et al. (2024).This involves training the PSF-Net with CSST PSFs and subsequently employing it to enhance the quality of observation images.Considering that many strong lensing systems exhibit relatively low signal-to-noise ratios, we advocate the utilization of the gray scale transformation algorithm outlined in Lupton et al. (2004).This approach renders dim celestial objects easier to be detected.Our proposed strategy involves training the swin transformer-based detection algorithm with a mix of CSST simulation data containing both noisy and clear images.Subsequently, we deploy this trained swin transformer to process images at various stages within the data processing pipeline, thereby providing clearer insights into the necessity of developing a robust data detection pipeline.
We test the performance of our algorithm using three sets of simulated images, and the results are depicted in Figure 9.In Figure 9(a), it is evident that clear images after gray scale transformation demonstrate outstanding performance, achieving a precision of 99% within a recall range up to 80%.This highlights the capability of the model to almost flawlessly identify nearly all strong lensing systems with exceptional accuracy.A slight decline is observed only at an IOU of 0.9, attributable to the stringent precision requirement for localization, leading to potential undetected strong lensing systems.In comparison to unprocessed clear images, the detection accuracy rises to 0.975 at an IOU of 0.5, representing a substantial 66% enhancement.The initial test indicates that gray scale transformation is a crucial step in data preprocessing.We utilize the asinh transformation to process noisy images, followed by employing the trained Swin Transformer to detect strong lensing systems in these images.
The detection results are presented in Figure 9(b).We use RT3 as clear data, RT4 as noisy data, RT3+asinh and RT4+asinh as asinh gray scale transformation, and RT4+restore+asinh as images after restoration and asinh transformation.In both Table 1 and Table 2, when comparing RT3+asinh and RT4+asinh, regardless of whether the IOU is 0 or 0.5, the mAP of noisy images is smaller than that of clear images.Moreover, as the recall rate ranges from 0.8 to 1.0, the precision decreases significantly, resulting in the loss of many strong lensing events.Therefore, we have decided to perform image restoration on the noisy images.
Finally, we combine both the image restoration and gray scale transformation algorithms to process noisy images and assess the performance of the detection pipeline.The results serve as a reference for anticipating potential scientific outputs of the CSST in the detection of strong lensing systems, as depicted in Figure 9(c).With an IOU of 0, the detection accuracy is 92.9%.Although the improvement in mAP might not be dramatic, there is a noteworthy increase in the recall rate when the IOU ranges between 0.2 and 1.0.This is crucial since the initial maximum recall rate is 90%, indicating that at least 10% of strong lensing systems may go undetected regardless of parameter adjustments.However, following image restoration, the recall rate reaches 100%, significantly reducing the number of undetectable targets.Hence, the image restoration algorithm proves indispensable and justifies its integration into the detection pipeline.The detection results demonstrate substantial improvement with the pipeline.In practical scenarios, configuring the IOU threshold to 0.5 yields a recall rate of 0.6 and a precision rate of 0.949.Considering the vast number of galaxies and potential strong lensing systems observable using the CSST, we have the potential to identify strong lensing candidates.However, within this large pool, thorough verification by human experts is necessary, posing a significant challenge.To address this, we are exploring collaboration with citizen science projects to assist in obtaining the final validated detection outcomes5 , which is inspired by the GalaxyZoo projects (Fortson et al. 2012).

Visual Analysis of Detection Results
It is crucial to examine the outcomes of the detection process and assess potential issues and risks associated with deploying the detection algorithm.We have evaluated four distinct scenarios and provide detailed discussions for each of them below: 1. Impact of Strong Lensing Systems with Large Size: Strong Lensing systems often have different sizes.In the label creation step, we standardize the bounding boxes to dimensions of 50 × 50 pixels.This uniform size facilitates efficient detection by the neural network while mitigating the influence of background and noise.This approach generally proves effective for most strong lensing systems.However, for certain instances with larger spatial extents, their structures surpass these bounding boxes.This over-extension disrupts their structural features and engenders information loss.Consequently, our neural network encounters difficulty in accurately pinpointing these extended gravitational lensing systems.As shown in the blue box in the top-left corner of figure 10, the blue boxes represent the strong lensing systems present in the labels, while the green boxes indicate the correctly detected strong lensing systems.The strong lensing system on the upper left is notably large.The bounding box with size of 50 × 50 pixel leads to a misplacement of the detection position at the center of the bounding box for this extended object.This misalignment results in flawed feature extraction by the model, yielding imprecise detection results.Normally the strong lensing systems larger than 50 × 50 pixels would hard to be located accurately.
2. Impact of Strong Lensing Systems with Small Size or Low Signal-to-Noise Ratio: Distinctive curved structures and the presence of multiple images are features of strong lensing systems.However, the detection of these strong lensing systems becomes particularly challenging when they are exceptionally small or exhibit low signal-to-noise ratios.As illustrated in the left panel of Figure 11(a), the lens structure in the bottom right corner of the image is extremely small, making it prone to identification as background or other objects.In the right panel of Figure 11(b), the strong lensing system is heavily impacted by background noise, making it difficult for the model to accurately recognize the objects.Both small-and low-signal-to-noise ratio strong lensing systems could be categorized as targets with a low signal-to-noise ratio, defined as SNR = TotalFlux_StrongLensing/Variance_Background (the ratio of total photons within strong lensing systems to the variance of background noise).Typically, identifying strong lensing systems in CSST observation data with a signal-to-noise ratio less than 16 is considerably challenging, with 0.36% of them being effectively undetected.
3. Impact of Bright Sources: Bright central galaxies and bright stars will affect the detection efficiency.Light from the bright central galaxies will make light from the strong lensing systems hard to recognize in multiple color space defined by filters of the CSST, thus these targets will not be detected in observation images.As shown in figure 12(a), there is a strong lensing system behind the galaxy, which are hard to be detected by our algorithm.Meanwhile, several bright stars with similar colour are very likely to be detected as strong lensing systems, which is probably caused by the similaritiy between them and strong lensing systems.As shown in the right panel of figure 12, the system in red colour stands for a false detection.Detection of strong lensing system regardless of bright central galaxies is a challenge for human beings.We can find that although our algorithm partly solves this problem through looking into the image with more colours, there are still gaps to fill.We need to further consider channel attention to design some novel algorithms to solve this problem.However it should be mentioned that the problems brought by bright stars may not be a problem, since we can mask these targets before we start to detect strong lensing systems.

DETECTION OF STRONG LENSING SYSTEMS FROM REAL OBSERVATION DATA
We evaluate the effectiveness of our pipeline using real observation data.Our testing involves processing images obtained from the DESI Legacy Imaging Surveys and media images from Euclid Early Release Observations using our pipeline.The images from the DESI Legacy Imaging Surveys contain numerous known strong lensing systems, providing an opportunity to assess the performance of our method.Despite the lower gray scale levels in the media images from Euclid, which is a space telescope with similar spatial resolution, we can still make a preliminary estimation of the scientific capabilities of the CSST in detecting strong lensing systems by using Euclid images as a reference.

Detection of Strong Lensing Systems from DESI Legacy Imaging Surveys
We use the following steps to test the performance of our pipeline with images from the DESI Legacy Imaging Surveys: • We have collected images from the DESI Legacy Imaging Surveys DR9 (Huang et al. 2020) that contain 5060 known strong lensing systems.Due to hardware limitations, we have resized these larger images to dimensions of 1000 × 1000 pixels, resulting in a dataset of 12667 images.In this dataset, we also split the data into the training set and the validation set in a 7:3 ratio.
• Given the random distribution of strong lensing systems in actual data, our model is designed to identify such systems regardless of their placement.This means that no matter where the strong lensing systems is situated within the image, our model should be adept at detection.Consequently, when generating training data, we deliberately shuffle the positions of strong lensing systems to simulate different scenarios.With the above steps, we have obtained the training data., and (c) depict the confusion matrix of our detection algorithm using confidence score threshold values of 0, 0.5, and 0.9 on the simulated data.As demonstrated in these figures, our algorithm performs well in detecting strong lensing systems, especially considering the significant influence of bright central galaxies and low signal-to-noise ratios on many of these systems.Additionally, it is important to highlight that we can achieve a satisfactory balance of precision and recall rates for further analysis by selecting a confidence score threshold of 0.5.
• We begin by choosing the detection neural network that has been trained using CSST simulation data as the initial weights.Subsequently, we employ the specified training data discussed earlier to fine-tune this neural network.With the learning rate set at 0.001, we have train the neural network over 100 epochs.
• After training, the neural network becomes capable of identifying strong lensing systems from images obtained from the DESI Legacy Imaging Surveys.To assess its performance, we utilize the entire set of 5060 images to evaluate the effectiveness of our pipeline, aiming to ascertain its ability to detect recognized strong lensing systems and also discover previously known strong lensing systems.

Detection Pipeline Test with Known Strong Lensing Systems
For detection of known strong lensing systems, we have set the confusion matrix with confidence score threshold of 0, 0.5 and 0.9 in figure 13.When the threshold is set to 0, the recall rate of the model is 84.3%, indicating that the model is able to capture a certain proportion of true strong lensing systems.At this threshold, the precision is 97.54%, meaning that 97.54% of the samples detected by the model are indeed true strong lensing systems, with only about 2.5% being false positives.As the threshold increases, the number of false positives gradually decreases, indicating a continuous improvement in the precision of the model.Particularly, at a threshold of 0.5, there are only one false positives, and the precision reaches 99.97%.This demonstrates that the detection ability and accuracy of the model significantly improve at higher thresholds.At a threshold of 0.9, there are no false positives, resulting in a precision of 100%.This further confirms that our pipeline has good performance in processing real observation data.
It is worth noting that the low false positive rate implies that when applying the model to large-scale sky survey data for detecting galaxy-scale lensings, there will not be a significant number of falsely reported strong lensing cases.This is a crucial property as reducing false alarms can save valuable time and resources in research, while enhancing the overall credibility of the study.These results provide crucial metrics for evaluating the performance of the model in identifying strong lensing systems in real data.It is evident that by adjusting the threshold, we can control the accuracy and recall rate of the detection results.Depending on specific requirements, we can choose a higher threshold (such as 0.9) to reduce false positives or lower the threshold to improve the recall rate.

Detection Pipeline Test For Discovery of New Strong Lensing Systems
While testing the performance of our algorithm on data from the DESI Legacy Imaging Survey DR9, we have identified several new strong lensing candidates.With a detection threshold set at 0.5, we have applied the pipeline to process the images illustrated in Figure 14.The dataset also comprises 5060 images, each with dimensions of 1000 × 1000 pixels, covering an area of 241.207 deg 2 .Following the procedures outlined in Section 3 and Section 4, these images are processed.The PSF-Net was trained using  2020).It is important to highlight that only a small subset of these images is utilized in this paper.Further elaboration on this issue will be provided in our future papers.
CSST images blurred by the Moffat model with variable FWHM values ranging from 2.0 to 8.0 pixels.After preprocessing and gray scale transformation, the neural network discussed in Section 6.1.1 is utilized for further image processing.The comprehensive task of processing all these images takes 72 hours and results in the identification of additional candidates beyond the known 5060 instances of strong gravitational lensings.In Appendix A, we provide the celestial coordinates of some these newly identified strong lensing candidates.These findings indicate that the majority of them exhibit arc-like structures or multiple images, emphasizing the need for further follow-up observations.To facilitate such endeavors, we have included a complete table of strong lensing systems in this paper, providing users with a comprehensive resource for undertaking subsequent observations.

Detection of Strong Lensing Systems from Media Images from Euclid Early Data Release
Euclid, a spaceborne telescope developed and operated by the European Space Agency, boasts a 1.2-meter aperture and a spatial resolution of 0.2 arcseconds.Given that the Euclid observes across the visible (550 nm) to near-infrared (900 nm) band and shares a spatial resolution similar to that of the CSST, its images can effectively assess the performance of our pipeline.To evaluate our algorithm, we utilize the media images released by Euclid on November 7, 2023.For testing purposes, we exclude the Horsehead Nebula because of its complex background, which could introduce additional errors into the detection algorithm.As these media images have already undergone processing by pipelines developed by Professor Jean-Charles Cuillandre, we directly apply our algorithm to detect strong lensing systems without the need for additional data preprocessing.However, given that the media images are tiff files, which have relatively low gray scale levels and are composed of data from both visible and near-infrared bands, the algorithm may make errors since it was trained on simulated CSST data covering the near-ultraviolet to visible band.We extract stamp images with dimensions of 200 × 200 pixels from the original images and set the confidence threshold for detection at 0.5.To ensure a complete detection of strong lensing systems, we also check the results by eyes.
In total, our algorithm identified 67 strong lensing system candidates, which we categorize into three groups: strong lensing systems with high quality, strong lensing systems with moderate quality, and false positive results.The Appendix B shows 15 candidates for high quality strong lensing systems, confirming the effectiveness of our method in detecting strong lensing systems.Additionally, there are 22 strong lensing system candidates with moderate quality, indicating that the completeness and purity of our algorithm can be adjusted by varying the confidence level.However, 30 of the detection results are false positives, mostly attributed to spiral galaxies and spikes generated by the diffraction of the spider.Given that the CSST employs a Three-Mirror Anastigmat system, which does not introduce diffraction arms, we anticipate that these false positives will not pose a serious problem.However, the lack of spiral galaxies with fine details remains a challenge for our algorithm.To address this, we plan to perform high-fidelity simulations to obtain data for further training of the neural network.

CONCLUSIONS
Traditional methods for searching for strong lensing systems are known for being time-consuming and labor-intensive.As we look ahead to upcoming sky survey projects, such as the CSST, the development of highly efficient detection algorithms becomes paramount.In response to these urgent needs, this study has developed a comprehensive pipeline.The process begins with the generation of mock observation images based on a catalog of strong lensing systems and other celestial objects, which serve as prior information.Subsequently, a neural network based on the Swin-Transformer architecture is trained using either simulated images or real observation images containing verified strong lensing systems.Through training, the neural network becomes adept at identifying strong lensing systems within observation images.To enhance the detection ablity of the algorithm, we introduce additional image preprocessing steps.These steps involve various operations, including grayscale transformation and image restoration, applied to the original astronomical image data.
We rigorously assess the performance of the model using a comprehensive set of evaluation metrics.This assessment encompasses simulated data and real observed data, as well as grayscale transformation applied to the data.When the noise in the data had a significant impact, we introduced an image restoration algorithm, combined with gray scale transformation, to enhance the completeness of the detection.When applied to simulated CSST observation data, our method demonstrates a detection accuracy rate of 98.6% and a recall rate of 99.79%.Given the remarkable observation capabilities of the CSST, we are optimistic about its potential to identify undiscovered strong lensing systems.Moreover, we extend the application of our pipeline to some images from the Euclid observation data, where it successfully identifies 37 strong lensing system candidates.
Moving forward, our strategic roadmap involves refining the data structure to align the proportion of strong lens data more closely with real situations.This crucial adjustment is anticipated to improve the precision of the detection capabilities of the model.In response to false detection results stemming from small lens/source structures and bright stars, we will enhance the data simulation algorithms by involving HST observations and hydrodynamic simulations to generate training sets adhering more rigorously to real observations, which is particularly important for the reliability of applying our framework to space-borne imaging surveys, such as CSST, Euclid, and Roman6 .Furthermore, the images from the DESI Legacy Survey DR9 that we have been using are limited, and only several images from Euclid telescopes are used.In the future, our plan involves searching for strong lensing systems from the DESI Legacy Survey DR 10, with the goal of identifying an increased number of strong gravitational lensing system candidates.Expanding our strategy, we plan to undertake joint training utilizing real observation data obtained from different telescopes.This concerted approach seeks to increase the accuracy and recall rate.This comprehensive plan not only promises more scientific outcomes for ongoing initiatives such as the CSST main survey or the data from Euclid but also holds the potential to significantly advance our capacity to discover other celestial objects, ranging from nebulae and galaxies to strong lensing systems.On 7 November 2023, the European Space Agency released five public images7 .We have chosen four of them to evaluate the performance of our algorithm, excluding images of the Horse Head Nebula because of their complex background.Using the deep neural network with weights trained on CSST simulation data, we directly detect strong lensing systems in these images.Subsequently, we visually inspect and classify the results into strong lensing systems with high quality, those with moderate qualities, and false positives.All strong lensing systems with high quality and some with moderate quality are shown in Fig-

Figure 1 :
Figure 1: The flowchart of image restoration step, which contains an image restoration neural network (RESTORE) and an image blurring neural network (PSF).

Figure 2 :
Figure 2: We chose the r, g, and i bands to generate a color image.Figure (a) shows the raw data, Figure (b) shows the data after the asinh transformation, Figure (c) displays the restoration results and Figure (d) shows the restored image after the asinh grayscale transformation.As depicted in these figures, the incorporation of image restoration and grayscale transformation steps significantly enhances the visibility of image details.
4. THE STRONG LENSING SYSTEM DETECTION ALGORITHM BASED ON THE SWIN TRANSFORMER 4.1.The Structure of the Strong Lensing System Detection Algorithm

Figure 3 :
Figure 3: As depicted in this figure, our method initiates the feature extraction process using the Swin Transformer.These extracted features are subsequently compiled into a feature map, which is then utilized for box regression and classification.

Figure 4 :
Figure 4: The specific structure and data flow of the Swin Transformer.The input image undergoes linear embedding, patch merging, Layer Norm, global pooling, and fully connected layers to obtain the final output.

Figure 5 :
Figure5: The diagram illustrates the computation diagram of self-attention with a shifting window.In Layerl, each window is averaged, and the self-attention is calculated independently for each window.In Layerl+1, the windows are shifted systematically, generating new windows that maintain connectivity between the windows when calculating attention.

Figure 6 :
Figure 6: The figure illustrates an efficient batch processing approach used for self-attention in the context of shifting window partitioning.

Figure 7 :
Figure 7: This figure illustrates the learning curve observed throughout the entire training and validation process.We can find that the model achieves convergence at approximately 80 epochs.

Figure 8 :
Figure 8: Figure (a) presents the PR curve of detection results from clear, unprocessed image, while Figure (b) displays the detection results from noisy, unprocessed images.As shown in these figures, noise and blur will introduce very strong effects to detection results and we need to design some algorithms to reduce these effects.In these figures, we can find that if we set smaller IOU, the AP will be larger, which means we can obtain higher recall rate and precision rate at the cost of lower position regression accuracy.

Figure 9 :
Figure 9: In Figure (a), the detection results of the clear image after gray scale transformation are depicted.Figure (b) illustrates the PR (Precision-Recall) curve of the noisy image after gray scale transformation, and Figure (c) displays the PR curve of noisy images after gray scale transformation and image restoration.Comparative to the PR curve shown in Figure8, the image preprocessing method enhances the detection capabilities of our algorithm and mitigates the effects caused by noise and blur.

Figure 10 :
Figure 10: The utilization of a fixed-size bounding box may affect detection outcomes, especially when dealing with large-sized strong lensing systems.Illustrated in this figure, the blue box denotes a specific strong lensing object, but the bounding box only partially encloses it.As a result, the IOU score becomes excessively low, hindering the accurate detection of this particular strong lensing event.

Figure 11 :Figure 12 :
Figure 11: The strong lensing system highlighted within the blue box in Figure (a) exhibits dimensions of 16 × 17 pixels and a signal-to-noise ratio (SNR) of 15.88.However, due to its small size, our algorithm was unable to detect it.The strong lensing system indicated by the blue box in Figure (b) has dimensions of 25 × 25 pixels and an SNR of 10.42.Despite its larger size, its relatively low SNR prevented our algorithm from detecting it as well.

Figure 13 :
Figure 13: Figures (a), (b),and (c) depict the confusion matrix of our detection algorithm using confidence score threshold values of 0, 0.5, and 0.9 on the simulated data.As demonstrated in these figures, our algorithm performs well in detecting strong lensing systems, especially considering the significant influence of bright central galaxies and low signal-to-noise ratios on many of these systems.Additionally, it is important to highlight that we can achieve a satisfactory balance of precision and recall rates for further analysis by selecting a confidence score threshold of 0.5.

Figure 14 :
Figure 14: The images selected for detecting strong lensing systems follow the distribution outlined in Huang et al. (2020).It is important to highlight that only a small subset of these images is utilized in this paper.Further elaboration on this issue will be provided in our future papers.

Figure 15 :
Figure 15: The figure demonstrates several detection candidates obtained from real observation data by our method.It can be observed that our model accurately detects and identifies strong lensing systems at arbitrary positions in the image.

Figure 16 :
Figure 16: The above figure shows the results of our method in identifying strong lensing systems in Euclid media images.We have individually examined and categorized all these images based on their morphology, classifying them into strong lensing systems of varying quality.The top three rows exhibit high-quality instances of strong lensing systems, whereas the fourth row displays strong lensing systems of moderate quality.The fifth row reveals some false positive detection results.The coordinates of these identified targets in different Euclid media images are highlighted in red color.

Table 1 :
We chose an IOU of 0.0 to generate a table and assess the performance of the model on various datasets.In this table, RT3 denotes clear data, RT4 represents noisy data, RT3+asinh and RT4+asinh indicate data subjected to asinh grayscale transformation, and RT4+restore+asinh is used for noisy images that underwent both image restoration and asinh transformation.

Table 2 :
We chose an IOU of 0.5 to generate a table and assess the performance of the model on various datasets.In this table, RT3 denotes clear data, RT4 represents noisy data, RT3+asinh and RT4+asinh indicate data subjected to asinh grayscale transformation, and RT4+restore+asinh is used for noisy images that underwent both image restoration and asinh transformation.

Table 3 :
Strong lensing system candidates obtained from some images obtained by the DESI Legacy Survey DR9.In this table, we show the RA, Dec and confidence of detection results returned by our algorithm.