An Automatic Approach for Grouping Sunspots and Calculating Relative Sunspot Number on SDO/HMI Continuum Images

The relative sunspot number is one of the major parameters for the study of long-term solar activity. The automatic calculation of the relative sunspot number is more stable and accurate as compared to manual methods. In this paper, we propose an algorithm that can detect sunspots, and divide them into groups to automatically calculate the relative sunspot number. Mathematical morphology was adopted to detect sunspots then group them. The data set used were the continuum images from SDO/HMI. The process was carried out on the overall HMI data available on the timespan from 2022 January to 2023 May with a time cadence of one day. The experimental results indicated that the method achieved high accuracy of 85.3%. It was well fitted with the international relative sunspot number provided by Solar Influences Data Analysis Center (CC = 0.91). We calculated the conversion factor K value of SDO/HMI for calculating the relative sunspots number (K = 1.03).


INTRODUCTION
The Relative Sunspot Number(SN) represents the strength of solar activity and provides the longest record of solar activity (Clette et al. 2014(Clette et al. , 2015)).Coronal mass ejections and strong explosions on the solar surface can be observed more frequently when the sunspot number reaches its maximum.Continuous observation of sunspots has made it possible to monitor solar activity and thus predict the space environment (Hathaway & Wilson 2004;Lukianova & Mursula 2011;Yan et al. 2011).SN has long also been used in studies of solar dynamo, predictions of Solar cycle, and studies of secular variation of the Earth's climate (Solanki et al. 2004;Wang 2004).
The SN was first recorded according to the equation proposed by Wolf at the Zurich Observatory in 1849, and later taken over by the Solar Influences Data Analysis Center (SIDC) (Vanlommel et al. 2004;Clette et al. 2007).The SN is defined by multiplying the number of observed sunspot groups by 10, adding the total number of observed sunspots to the product, and then multiplying by the parameter K (Hossfield 2002;Clette et al. 2016).K is a conversion factor, which stipulates that the K of Zurich Observatory is 1.The values of K for the other observatories are determined by comparing their observations with the Zurich Observatory's Sunspot Number.The K value is related to factors such as weather conditions (transparency and Astronomical seeing) at the observatory, the aperture of the telescope and the experience of the observer.Different observation stations or instruments have different K values.
Conventionally, the calculation of SN is manual based and due to human subjectivity, the results are not stable.In recent years, with the development of image processing technology and deep learning technology, a variety of automatic sunspots extraction technologies have been proposed (Curto et al. 2008;Zhao et al. 2016;Carvalho et al. 2020;Hanaoka 2022).The statistical accuracy of the total number of sunspots has gradually improved.However, there are few automatic methods for sunspots grouping, the accuracy is also very low.For instance, Dasgupta et al. (2011) used clustering algorithm to group sunspots, and compared their results with Wolf Number.The difference error between the two was about 10%.Palladino et al. (2022) used the deep learning technology to group sunspots, and the accuracy of sunspot group recognition was 44.22%.
As the new solar observation instruments have collected massive sunspot data, a stable and accurate approach to accurate the relative SN is needed for these data.In this paper, we proposed an algorithm that could detect and group sunspots so that the Relative Sunspot Number could be calculated automatically.The dataset used in the study were SDO/HMI Continuum images.We further compared the calculated SN with manual means, Catalogue of Heliophysics Features (HFC) and SIDC respectively.In addition, the K value of the HMI was also calculated.To our knowledge, it is the first time that the K value for the HMI has been calibrated.The paper is organized as follows: Section 2 focuses on data.The proposed model is given in Section 3. Section 4 presents the results and discussion.Section 5 provides the method to calculate the K value.Finally, the summary is provided in Section 6.

DATA
We used the continuum images from SDO/HMI as the dataset.SDO/HMI is a space device, conducts continuous observation of the day 24 hours without interruption.It started the regular observation on April 30, 2010, providing high-quality sunspot full-disk data.It publishes a set of data every 15 minutes, including magnetic field data and corresponding continuous images.There are four different scale size of images, including: 256, 512, 1024, and 4096.We selected the daily images at 08:45:00UT from January 2022 to May 2023, with the image size of 1024 and an interval of one day.There were 507 images included in total.The original images were RGB, and converted to Gray by the standard Matlab program rgb2gray().

METHODS
As SN is calculated by the number of sunspots and sunspot groups, the method proposed in this paper consists of two pipelines: the first part is an automatic sunspot recognition method for automatically detecting sunspots and calculating the total number of sunspots on the solar disk.The second part is an automatic grouping method of sunspots, for automatically calculating the number of sunspot groups.
Despite the rapid development of deep learning and computer vision technology used in image processing, here we still chose the traditional image morphology processing techniques.It is mainly based on the following considerations: (1) For images with uniform background, the target and background grayscale are significantly different, the traditional Mathematical Morphology(MM) generally has good performance.As sunspots are obviously different from the solar disk, MM should be suitable for detecting sunspots.(2) Deep learning technology requires a large amount data annotation and high computational performance, which requires a lot of manpower and computational power.While traditional image processing technologies do not impose these requirements.

Mathematical Morphology Algorithm
Mathematical morphology is a discipline of image analysis based on lattice theory and topology, which studies spatial structure and morphology (Serra 1982;Heijmans 1995).The basic concept of it is to change the shape and features of an image by performing specific operations with the structural elements on the images.The commonly used structuring elements are crosses, squares, and open disks.
The basic morphological operators are erosion and dilation.If we set a binary image as J : u → 0, 1, and the foreground pixels are u f = J −1 ({1}).The erosion and dilation of the binary image J by the structuring element S ∈ Z × Z are defined as: The other two major operations in morphology are opening and closing.Opening operation is considered to be erosion, followed by dilation and this operation eliminates small objects and sharpens peaks in the object.Closing operation is first dilation and then erosion, which fuses narrow breaks and fills small holes and gaps in the image.
Dilation is an operation that looks for local maximum and makes the highlighted area larger.Erosion is the operation that looks for the local minimum and makes the highlighted area smaller.For sunspot images, the highlighted part is the solar disk, and the black parts are the sunspots.As the dilation and erosion operations are two dual operations, the dilation and erosion operations can be applied to the black area as well.Dilation makes sunspots smaller, and erosion makes sunspots larger.

Automatic recognition of sunspots
We described the automatic sunspot identification method in detail in Zhao et al. (2016), and here we briefly describe the processing flow with the SDO/HMI continuum image of January 14, 2022 as an example: (1) For the observation image shown in Figure .1(a), the morphological closing operation was applied to eliminate sunspots and other darker areas on the solar surface, obtain a clean solar disk, as shown in figure 1(b).The shape of the structuring element used in closing is disk, with a size of 10.
(2) Figure 1 (4) To eliminate the noises of the solar disk that can be easily regarded as sunspots (which may be instrumentgenerated noise or darker areas on the solar disk), we set the area of each candidate region to be greater than a fixed value of 5.In addition, since the gray value of the noise is relatively uniform, and the gray value of the sunspots center are lower than that at the edge, we set that the difference between the maximum and minimum values of pixels in the set candidate area needs to be larger than a fixed value of 5.With the above rules, the noises on the solar surface were removed and the pure sunspot regions were obtained, as presented in Figure 1(e).The sunspots are marked in red.

Group sunspots and calculate the relative SN automatically
Since sunspots belonging to a group are generally clustered together, we assume that the sunspots smaller than a certain distance should be classified as a group.Therefore we designed the algorithm flow shown in Figure 2: (1) The white regions in Figure 2 (2) Each white region in Figure 2(b) was labelled and numbered with a red rectangle, and then superimposed on the original Figure 1(a) to obtain the sunspots grouping Figure 2(c).This step was made automatically by Matlab programs.It was divided into three steps: Firstly, the program regionprops() was used to get each white area bounding box.It was the smallest rectangle containing the white area, which was extracted by the program automatically.Secondly, the program rectangle() was applied to show it out.Meanwhile the program text() was used to label it.At last, the bounding boxes were superimposed to the original image.
(3) To show the grouping effect clearly, we enlarged the sunspot groups numbered 3 and 9, as shown in Figure 2(d) and 2(e).It could be seen that the smaller and darker sunspot groups and sunspot groups at the edge of the solar disk have been identified, and the algorithm was more effective in identifying sunspot groups.
As shown in the figure below, the total number of sunspots was 15 and the number of sunspot groups was 9, so the relative number of sunspots was 105.

Compare our results with HFC's
Catalog of Heliophytics Features (HFC) (Bonnin et al. (2013)) is the Solar Survey Archive of BASS2000 available at https://bass2000.obspm.fr/home.php.It provides daily sunspot groups numbers, which were adopted here as a comparison with our method.
Since the HFC can only select a particular date to search for the number of sunspot groups on that day, it can not select a time span to query, which is more laborious.Therefore only a small period of data could be used in the study, and this was not statistically significant (we will compare statistically with long-term data from SIDC later in Section 5).We searched for the daily sunspot groups numbers from April 1 to April 30, 2023, and calculated it in the same period by our algorithm.The results are shown in Table 1.
As can be indicated in Table 1, our results generally follow the same trend as those of HFC.However, the numbers of sunspot groups calculated by our algorithm were mostly larger than that of HFC.This was because our algorithm   could detect weak sunspot groups that have just appeared or were about to disappear.The sunspot groups in the left image in Figure 3(a) were detected by our algorithm.The sunspot groups in the right image were marked by HFC.By comparison, the group numbered 3 was located at the edge of the solar disk, and it was found by our algorithm but ignored by HFC.In a few case, our algorithm mistook a large discrete sunspot group for two small groups.For example in Figure 3(b), the sunspot group numbered 2 and 3 should actually belong to a group.It was because our algorithm divided sunspots into a group according to the distance between sunspots.If the distance between sunspots in a large group was bigger than a fixed value, it would be divided into two groups.

Accuracy and Efficiency
To test the accuracy of our model, we set up a benchmark for comparison using the manual labelling results.As shown in Table 2, we manually annotated the data from April 1, 2023 to April 30, 2023 and then calculated four parameters: True Positive (TP), False Negative (FN), False Positive (FP), and True Negative (TN).They are the concepts from the Confusion matrix (Visa et al. (2011)).TP means that the sunspot groups on the solar disk are correctly identified and predicted to be positive.FN indicates that the originally existing sunspot group on the solar disk is not recognized, and is incorrectly predicted as the background area of the solar disk.FP indicates that the solar background or noisy regions are mistakenly identified as a sunspot group.TN represents the correct recognition of the solar background or noise regions as background, which is meaningless here.
To evaluate the model, we used machine learning classification algorithm metrics such as Accuracy, Precision, Recall, and F1 Score, as shown in formulas (1) -(4) (Goutte & Gaussier 2005;Yacouby & Axman 2020).Accuracy refers to the "number of correctly predicted samples/total number of samples".Precision refers to how many of the samples we predict to be true are indeed true.Recall refers to how many samples that are actually true have been selected by the model.Precision and Recall are often mutually exclusive.The F1 score takes into account both Precision and Recall factors, achieving a compromise between them.The F1 score can be directly used to evaluate the performance of a model.

Accuracy =
T P + T N T P + T N + F P + F N (3) After substituting the data in Table 2 into the equation, the results can be seen in Table 3.As indicated in the table, the performance of our model is significantly better than that of Palladino et al. (2022).Palladino et al. (2022) proposed a method where a pipline of state-of-the-art sunspots groups detection and classification.Deep neural networks (DNNs) was adopted in their study to detect and classify sunspots groups in real time.They used the McIntosh system for sunspot groups classification and the generation process was carried out on the overall HMI data available on the timespan that went from 2010 to 2021 with a sampling time of 1 day.In order to classify sunspots groups, they paid more attention to the area sunspots groups in detection stage.Therefore, the accuracy metrics they choose is Intersection over Union (IoU) between predicted bounding box and ground truth bounding box.Intersection over union was then computed as the ratio of the overlapping area of the two bounding boxes over the union of their total area).Our choice of accuracy metrics was dependent on the number of sunspots groups, which was more reasonable as our purpose was to automatically calculate the relative SN.
As can be seen from the table, our algorithm has high recall rate of 99.3%, indicating that almost all sunspot groups present on the solar disk can be accurately detected.The precision rate of 85.8% is lower than the recall rate, which could explained by the fact that faint sunspot groups that were difficult to observe manually were now detectable by our algorithm.
In terms of computation time, our algorithm processed over 500 pieces of data on a single core regular desktop computer, in 5 minutes, with an average processing time of 0.6 seconds per data.In contrast, the algorithm proposed by Palladino et al. (2022) initially took nearly 8 hours of computation time to generate nearly 3000 images on a regular desktop computer.After parallelizing on a 6-core CPU, the computational time was reduced to 1 hour.Our algorithm is clearly faster and has better real-time processing performance.

CALCULATE THE K VALUE OF HMI RELATIVE SUNSPOT NUMBER
By writing a Python data collection program, we downloaded data between January 2022 to March 2023 and selected a daily sample of 507 images from http://jsoc.stanford.edu/data/hmi/images/.Then our algorithm was conducted to calculate the relative SN.On the other hand, with SIDC as calibration, we downloaded the SN in the same time span from https://www.sidc.be/SILSO/home.Then the correlation between them was calculated and shown in Figure 4.The horizontal axis represents the date (year-month), and the vertical axis denotes the Relative SN.The correlation coefficient between the two sets of data is 0.91.It is highly consistent, indicating that our algorithm is reliable.
As shown in Figure 5, we performed the least squares fit to both sets of data, which resulted in a fit of K=1.03.This is the scaling value of the HMI instrument.

SUMMARY
The long-term relative SN is statistically helpful in exploring the periodic patterns of the sun.In this paper, an automatic algorithm for calculating the relative SN is proposed.The experiments were conducted on HMI dataset.The conclusions of this article are as follows: (1) The performance of our algorithm: Accuracy=85.3%,Precision=85.8%,Recall=99.3%,and F1 Score=92.1%.(2) The average processing time of our algorithm is 0.6 seconds per data, which is remarkably fast and real-time.(3) The relative SN calculated by our algorithm is highly consistent with SIDC with a correlation of 0.91.This indicates our algorithm is reliable.(4) For the first time, we have calculated and confirmed the K value of SDO/HMI, which is K=1.03.
(a) was subtracted from Figure 1(b) yield the darker regions on the solar disk, as shown in Figure 1(c).(3)Figure1(c) is segmented automatically by the threshold to obtain the Figure1(d), which was a binary image with white areas as candidate sunspots.Considering the limb darkening, the threshold for the area within [1:1000,100:900] was set to 22, and the rest was set to 15.
(a) are the sunspots detected in the first stage.Morphological Erosion operation was carried out on Figure 2(a), and a suitable structural element of size 15 was selected experimentally.The white areas closer to each other merged into a large area, as shown in Figure 2(b).

Figure 1 .
Figure 1.An example of continuum image on January 14, 2022 to carry out automatic sunspots recognition: (a) the original image, some sunspots are seen on the solar disk.(b) the result of performing a closing operation on (a).(c) the result of subtracting (a) from (b).(d) a binary image, which is the result of the threshold automatic segmentation of (c).(e) sunspots marked with red color are detected on the disk.

Figure 2 .
Figure 2. Group sunspots after they are be detected: (a) the sunspots detected in the first stage.(b) the results of erosion calculation on figure (a).(c) sunspots grouping labeled with a total 9 groups and overlayed to the original image.(d) enlarge and display group 3. (e) enlarge the display group 4.

Figure 3 .
Figure3.Compare the sunspot grouping difference between our algorithm with HFC.On the left is the number of sunspot groups calculated by our algorithm on 2023-4-3.On the right is the number of sunspot groups marked by HFC on 2023-4-7.

Figure 4 .
Figure 4. Correlation between the relative SN of our algorithm and of SIDC

Table 1 .
Compare the number of sunspot groups calculated by Our Algorithm with HFC's

Table 2 .
Sunspot groups number by our model and by manual means

Table 3 .
Performance on Sunspot Group Detection.