A statistical primer on classical period-finding techniques in astronomy

The aim of our paper is to investigate the properties of the classical phase-dispersion minimization (PDM), analysis of variance (AOV), string-length (SL), and Lomb–Scargle (LS) power statistics from a statistician’s perspective. We confirm that when the data are perturbations of a constant function, i.e. under the null hypothesis of no period in the data, a scaled version of the PDM statistic follows a beta distribution, the AOV statistic follows an F distribution, and the LS power follows a chi-squared distribution with two degrees of freedom. However, the SL statistic does not have a closed-form distribution. We further verify these theoretical distributions through simulations and demonstrate that the extreme values of these statistics (over a range of trial periods), often used for period estimation and determination of the false alarm probability (FAP), follow different distributions than those derived for a single period. We emphasize that multiple-testing considerations are needed to correctly derive FAP bounds. Though, in fact, multiple-testing controls are built into the FAP bound for these extreme-value statistics, e.g. the FAP bound derived specifically for the maximum LS power statistic over a range of trial periods. Additionally, we find that all of these methods are robust to heteroscedastic noise aimed to mimic the degradation or miscalibration of an instrument over time. Finally, we examine the ability of these statistics to detect a non-constant periodic function via simulating data that mimics a well-detached binary system, and we find that the AOV statistic has the most power to detect the correct period, which agrees with what has been observed in practice.


Introduction
Periodic variable objects, i.e. objects whose properties change over regular intervals of time, are ubiquitous in the universe and are a source of continuous study for astronomers.For example, the periodic variation of pulsating stars allow astronomers to calculate the distance to these stars and use them as standard candles or for cosmology, like measuring the Hubble constant (Bahcall 2015, Riess et al. 1998).Alternatively, periodic changes in a star's brightness could be used to classify a star as being part of a binary system, e.g.Xu et al. (2022), or indicate the presence of a planetary companion, e.g.Rappaport et al. (2012), Lafarga et al. (2020).Outside of stars, the periodic rotation of Cold Classical Kuiper Belt Objects contain information about planet formation and the development of our solar system during the first 100 million years after the Sun's birth (Thirouin & Sheppard 2019).Thus, accurately characterizing a periodic signal in observed data is of vital importance to astronomers.
However, because the field of statistics diverged from astronomy in the late 19th century, statisticians had little involvement in the development of these methods.This lack of input from statisticians has led to concerns over the statistical properties of these classical astrostatisical methods for period detection.For instance, in reference to the PDM statistic, Feigelson et al. (2021) state: "None of these statistical issues [e.g.non-Gaussian, correlated error and multiple-testing] have been examined by mathematical statisticians despite the use of the PDM in ∼1,500 astronomical papers over four decades."Though, of course, these concerns are also valid for the other methods introduced above.Our paper begins to fill this gap by investigating and comparing the theoretical and empirical properties of these classical statistics and aims to open a dialogue within the astronomical community about the appropriate use of them.To the best of our knowledge, no such discussion and comparison has been published from a statistician's perspective, though some have been published by astronomers, e.g.Hara & Ford (2023), Heck et al. (1985), VanderPlas (2018).
The organization of the paper is as follows.Section 2 briefly reviews some examples of where the discussed techniques are commonly employed in astronomy.Section 3 discusses the statistical model of the raw data.Section 4 discusses the hypothesis test of detecting a periodic signal from data.Section 5 details the phase-based methods; specifically the AOV, PDM, and SL methods are discussed.Section 6 details the LS power.Section 7 empirically demonstrates through simulation studies the theoretical properties discussed in Sections 5 and 6, explores the robustness of these methods to challenges faced in real-world datasets, and briefly discusses the power of these statistics to detect periodic signals encountered by astronomers.Finally, Section 8 finishes with some concluding remarks.Code and source files to reproduce all figures are publicly available at https://ngierty.github.io/research.html.

Astronomical Context
In this section, we briefly demonstrate where the statistical methods for period detection discussed in this paper are commonly employed in astronomy.In particular, we highlight that these methods have been used for determining the periods of variable stars and other objects inside and outside of our solar system.
Variable stars are divided into two groups based on their type of variability: 1) extrinsic variability and 2) intrinsic variability.The extrinsic variability in a star's brightness is due to some external change like rotation of the star or eclipses.For example, a star's brightness changes as a result of a star's rotation because dark solar spots on the surface of the star appear and disappear from view (Griffiths 2018).Alternatively, in an eclipse, the light we see from a star is blocked by another object, often a companion planet or star, and appears as a dip in the light curve (Tohline 2002).One of the most common types of these variable stars are eclipsing binaries, where there are two stars orbiting each other.Multiple studies have determined the periods of such systems using the methods studied here in the Milky Way and beyond, e.g.Stassun et al. (2004), Graczyk et al. (2011), Prša et al. (2011) and Xu et al. (2022).Additionally, these methods have occasionally been used to detect the presence of extra-solar planets.For instance, Anglada-Escudé et al. (2016) discovered a terrestrial exoplanet candidate around Proxima Centauri using the LS power.
On the other hand, intrinsic variability is from variations in the physical properties of the star.These variables are classified into pulsating, eruptive, or cataclysmic.A pulsating variable star's radius contracts and expands periodically which changes the star's brightness (Eyer & Mowlavi 2008).RR Lyrae and Cepheids are examples of pulsating stars, and many studies used AOV, PDM, and LS to estimate their period, study the period's relation to the star's luminosity, and estimate the star's distance from Earth (Matsunaga et al. 2006, Nemec et al. 2013, Udalski et al. 1999, 2018, Torrealba et al. 2015, Gerke et al. 2011, Shappee & Stanek 2011).An eruptive variable star's surface goes through eruptions and accretions of matter, causing variations in brightness (Good 2003).Protostars, Orion variables, and gas giants are examples of objects that are frequently classified as eruptive variables.Finally, a cataclysmic variable star goes through a sudden change in the surface brightness of a star at the end of main-sequence life (Knigge et al. 2011).Novae and supernovae are examples of this type of variation (Baade & Zwicky 1934, Chomiuk et al. 2021).While less frequent, the statistical methods discussed in this paper have also been used as a first step to identifying stars in the eruptive and cataclysmic variable star sub-categories, e.g.Alfonso-Garzón et al. (2012).
Finally, the methods examined here have also been used to determine the periodic properties of non-stellar objects.For example, several studies studied the periodicity of objects within our solar system such as the quasi-periodic variation in solar wind speed and geomagnetic activity (Mursula & Zieger 2000), the periodicity of outburst activity in comets (Trigo-Rodríguez et al. 2010), and the rotation period of Kuiper Belt Objects (Lacerda et al. 2008).Additionally, these methods have been used to determine the periodicity of more exotic astrophysical systems such as Fast Radio Bursts (FRBs) (Rajwade et al. 2020), short-period variable quasars (Charisi et al. 2016), γ-ray bright Blazers (Bhatta & Dhital 2020, Bhatta 2021, Otero-Santos et al. 2023), methanol masers (Goedhart et al. 2014), and γ-ray periodicity in active galactic nuclei (AGN) (Peñil et al. 2020).
This brief summary shows the importance and diverse use of the statistical methods reviewed in this paper for studying the time series observations of many astrophysical objects and phenomena.

Statistical Model of Raw Data
Period detection methods can be used for any numerical observation that a researcher believes comes from a periodic signal (including a constant signal) that has been observed with some noise.For example, to determine whether a star of interest is a variable star, astronomers measure the star's flux or apparent magnitude, both standardized measurements of the amount of light Earth receives from a star, for several time points.This flux or magnitude is often assumed to come from a periodic signal but observed with additive noise.This section details how such an observed value can be represented mathematically in the time domain or the frequency domain, which are then used to construct the PDM and AOV statistics in Section 5 and the LS power in Section 6.
Specifically, the statistical model for a numerical observation, X t , at time t where h(t) is a periodic function oscillating around zero with true period p 0 (i.e.h(t) = h(t + kp 0 ), for any integer k), and ϵ t iid ∼ N (0, σ 2 ) for some σ > 0; in other words, we assume the random noise is a realization of an independent and identically distributed Gaussian random variable.This periodic function can be written in either the time domain, as presented in equation ( 1), or can be transformed into the frequency domain using the discrete Fourier transform defined as (2) for frequency f ∈ R, imaginary i, and equally spaced time points t.
There are two important aspects to note about the data generating model in equation ( 1).First, a constant function is considered a periodic function with p 0 = ∞ (Priestley 1981).While this is not of direct interest to astronomers, it is important in calculating the FAP, often referred to as the p-value in statistics literature.This issue is discussed more below in Section 4 and Section 5. Second, assuming ϵ t iid ∼ N (0, σ 2 ) is a fairly strong assumption which is not always met in practice (Feigelson et al. 2021).Specifically, the assumption of homoscedastic variance is often violated in astronomical data sets due to measurement error; fortunately, however, astronomers can typically calculate their measurement error.This information can be used to rescale the data to correct for heteroscedastic measurement errors.In particular, given heteroscedastic data generated according to model (1) with ϵ t indep ∼ N (0, σ 2 t ), the transformed data Xt = σ −1 t X t has homoscedastic variance.However, care must be taken as correcting for heteroscedasticity in this way assumes that all noise in the process has been accounted for in determining σ t .
Finally, if either the homoscedasticity or the Gaussianity assumption is violated, the calculated FAPs will misrepresent the actual false positive rate.This could lead the practitioner to a belief about a spurious result or to overlook a meaningful result.There are statistical tests that can and should be used to test the assumptions of Gaussianity, e.g.D'Agostino's K-squared test (D'Agostino 1970) and the Shapiro-Wilk test (Shapiro & Wilk 1965), and homoscedasticity, e.g the Breusch-Pagan test (Breusch & Pagan 1979) or the Goldfeld-Quandt test (Goldfeld & Quandt 1972).However, these tests often have additional assumptions and may not be appropriate for small sample sizes.Unless the practitioner has prior knowledge of the validity of these assumptions, they should avoid computing or interpreting analytical FAPs (those based on the parametric distributions discussed below) and instead use FAPs calculated from bootstrapping the data.Though, the practitioner should be warned that bootstrapping is not a panacea, even if the computational burden is low.First, the variability in the bootstrapped estimates is smaller than the variability that would be expected in the estimates obtained from new samples.This is because the population from which the bootstrapped samples are drawn is fixed as the observed sample.While this issue is not particularly problematic for medium and large sample sizes, practitioners should be wary of applying this method for smaller sample sizes; see Chernick (2011) for recommendations.Second, the bootstrap method may fail to provide accurate estimates under certain transformations of the data, e.g. the minimum or maximum of a set (Chernick 2011).However, there have been proposed remedies to the bootstrapping procedure to address such limitations and maintain desirable properties of the bootstrap under certain conditions (Chernick 2011, Fukuchi 1994).

Hypothesis Test for Periodicity
To perform a test for periodicity, one must test whether the data come from a nonconstant periodic function, i.e. test if p 0 < ∞, e.g.Dworetsky (1983), Schwarzenberg-Czerny (1989, 1997) and Stellingwerf (1978).This can be written as a formal hypothesis test as follows: The FAP is then calculated using the distribution of the test statistic (e.g.PDM, AOV, etc.) assuming H 0 is true.As will be shown in Sections 5 and 6, only the PDM, AOV, and LS power statistics have known closed-form distributions under H 0 .Thus, for the remainder of the paper, when we discuss a test for periodicity, as well as calculating an FAP, we are doing so under (3).Finally, since the true period is unknown, many trial periods are tested.The FAP associated with the period that gives the minimum PDM or SL statistic or the maximum AOV statistic or LS power does not follow the same calculation as the FAP for an arbitrary period.Further, the FAP associated with the period that gives the extreme statistic, e.g. the maximum AOV, will automatically control for considering a range of potential periods and calculating statistics at each of them.This is discussed in more detail below.Issues of multiple-testing occur, for instance, if one period detection method is used to narrow a large range of periods to a more manageable set of trial periods, and then another period detection method is used only on the smaller set of trial periods to determine which periods are significant; or only the smaller set of periods is used in constructing downstream models which undergo their own model selection procedure, e.g.Mowlavi et al. (2023) and Ritchie et al. (2022).given some observations that come from a periodic function.The left panel displays simulated observations generated as perturbations from a periodic function.The middle and right panels illustrate the result of phase-folding the observations according to an incorrect and the correct trial period, respectively.Clearly, if one were to bin the observations based on the phase-folded time then the bin variances would be smallest and the differences in bin means would be largest in the right-most figure.Similarly, a line connecting all observations in phase-based order would be shortest in the right panel.

Phase-Based Methods
Phase-based methods aim to determine the period of a periodic function based on noisy observations, and they leverage the idea that observations can be folded (or phase-folded) in time modulo the true period of the function.We will denote the phasefolded time as ϕ t = (t mod p)/p; thus, the phase-folded-time-observation pair will be denoted (ϕ t , X t ).Given a trial period, the observed data can always be phase-folded, but if the trial period is not close to the true period then the phase-folded values likely will not align with the function h in phase.For example, the left panel of Figure 1 shows a set of arbitrary observations, e.g.magnitude or flux, that has been centered and scaled with their corresponding (arbitrary) observation times where centering and scaling, here, refers to subtracting the sample mean from each observation and then dividing by the sample standard deviation.The middle and right panels then show the same data phase-folded for an incorrect and correct trial period, respectively.Clearly, the data in the right panel aligns with the underlying function.
In the classical literature on these methods, Lafler & Kinman (1965), Stellingwerf (1978), Schwarzenberg-Czerny (1989), Whittaker & Robinson (1926) suggest binning the phase-folded data into groups with similar phase-values and deriving statistics based on ratios of sums of squared deviations to test various trial periods.Specifically, the AOV statistic aims to measure the difference in the bin means and the PDM statistic aims to measure the differences between the observations within a bin.Alternatively, Burke Jr et al. (1970) suggested measuring the fit of the trial period p based on the Euclidean distance between neighboring time-observation pairs after phase-folding; this method is often referred to as the SL method.Examining Figure 1 again, clearly, the phasefolded observations in the right panel will have bin means that are the most different (i.e. the largest AOV), bin variances that are the smallest (i.e. the smallest PDM), and the smallest SL compared to other trial periods.We discuss these methods separately below.

Sum-of-Squares Methods
The basic methodologies for detecting non-sinusoidal periodic signals as developed in Stellingwerf (1978) and Schwarzenberg-Czerny (1989) rely on a simple sum-of-squares decomposition.They are intuitively elegant, and the theory behind these methods, namely Cochran's Theorem, is well-rooted in classical statistical results dating back to the early twentieth century (Monahan 2008).The purpose of this section is to provide a detailed account of the statistical theoretical foundation for these tests.
Namely, for a trial period p, the null and alternative hypotheses suggested by Stellingwerf (1978) are formulated as where µ 1 , . . ., µ r are defined as the bin means for the phase-folded data modulo p.The bin means being equal (as in H 0 ) suggests no periodicity in the observed values, and so statistical evidence supporting the assertion of a periodic signal in the data is established if a test rejects H 0 in favor of H a .Note, this is equivalent to the hypothesis test setup in equation ( 3) after the data has been phase-folded and binned.The classical PDM and AOV test statistics are derived next.‡ Let X ij be an observation, e.g.observed flux of a star, which has already been binned in-phase based on some trial period p, where i ∈ {1, . . ., r} is the bin index, r is some specified number of bins, j ∈ {1, . . .n i } is the observation index for the i-th bin, and n i is the number of observations in bin i.Let n := i n i be the total number of observed time instances, and take S 2 0 , S 2 1 , and S 2 2 to denote the total sample variation, the across-bins mean variation, and the within-bin variation, respectively.Then, (5) The PDM statistic proposed by Stellingwerf (1978) and the AOV statistic proposed by Schwarzenberg-Czerny (1989) are defined, respectively, as and the trial period p is tested by determining the FAP, which is the tail probability of the sampling distributions of these statistics under the null hypothesis H 0 .
Under H 0 and model (1), Stellingwerf (1978) incorrectly argued that the PDM statistic followed some parameterization of the F distribution.The inappropriateness of the F distribution for the PDM statistic is discussed in Heck et al. (1985) and referenced in Nemec & Nemec (1985) with inaccurate mathematical justification for why it does not follow an F distribution.These authors erroneously claim that the PDM statistic does not follow an F distribution because the numerator, S 2 2 , does not follow a chi-squared distribution.They are misled in remarking that the matrix associated with its quadratic form is not idempotent.Subsequently, however, Schwarzenberg-Czerny (1989) correctly concluded that the PDM statistic does not follow a F distribution since the numerator and denominator are not independent (as is clear from equation ( 5)).In a followup paper, Schwarzenberg-Czerny (1997) argues that a scaled version of the PDM statistic, specifically n−r n−1 T PDM , follows the beta distribution with shape parameters n−r 2 and r−1 2 , assuming H 0 is true.Additionally, Schwarzenberg-Czerny (1989) accurately states the AOV statistic follows the F distribution with degrees of freedom r − 1 and n − r under H 0 .
Arguing that T AOV ∼ F r−1,n−r is a direct application of Cochran's theorem which is stated below; the proof is omitted.The statement roughly follows the lecture notes of Feng (2015) for clarity and accessibility to the reader; alternative formulations and proofs can be found in statistical texts such as Monahan (2008).‡ Note, since the PDM statistic is a generalization of the Lafler and Kinman technique and a modification of the Whittaker and Robinson statistic (Stellingwerf 1978), we will not discuss these.
. ., k} where χ 2 r i denotes the chi-squared distribution with r i degrees of freedom.
Note, equation ( 5) can be expressed in terms of the quadratic forms where X := (X 11 , . . ., X 1n 1 , . . ., X r1 , . . ., X rnr ) ′ , I n is the n × n identity matrix, J n is an n × n matrix of ones, and P V := V (V ′ V ) −1 V ′ is the projection matrix onto the column space of V := diag{1 n 1 , . . ., 1 nr }.Note, V is simply a matrix whose columns correspond to a bin number and rows are indicators of whether an observation belongs to a particular bin; in other words, V i,j (the ith row and jth column of V ) is an indicator for whether observation i has been phase-folded into bin j.If p 0 < ∞, we take V as the matrix of indicators if the data are phase-folded modulo p 0 .More simply, V is the design matrix in the linear regression X = V µ + ϵ with µ = (µ 1 , . . ., µ r ) ′ and ϵ := (ϵ 11 , . . ., ϵ 1n 1 , . . ., ϵ r1 , . . ., ϵ rnr ) ′ .Finally, Cochran's theorem applies to establish that follows by a series of arithmetic steps to obtain Then, since T AOV ∼ F r−1,n−r , taking derivatives on both sides yields, for u ∈ [0, 1], ; the density function of the beta n−r 2 , r−1 2 .Observe that the σ 2 terms cancel when constructing T AOV and T PDM .A point of caution is that this cancellation will not occur when the errors are heteroscedastic, invalidating the appropriateness of relying on critical values from the F or beta distribution, respectively.In the heteroscedastic setting, the practitioner would need to know the standard deviations to scale each observation as discussed in Section 3.
Under H 0 , the PDM and AOV statistics will be from the beta n−r 2 , r−1 2 and F r−1,n−r densities, respectively, regardless of the trial period.However, if H a is true, the distributions of T AOV and T PDM depend on whether the trial period p is the true period p 0 or not.If p = p 0 , it is established from well-known statistical theory that Monahan (2008) and Chattamvelli (1995a) for details.In the case that p ̸ = p 0 , the flux values have been folded modulo the wrong period, and so the true data generation equation has the form X = W µ + ϵ for some design matrix W ̸ = V .Accordingly, T AOV and T PDM follow doubly non-central F and doubly non-central beta distributions, respectively (Chattamvelli 1995b).While the distributions that arise under H a could theoretically be used to determine the power of the tests, assumptions would need to be made on µ and σ on a case-by-case basis and are outside the scope of this paper.
Finally, it should be noted that the distributions discussed above are the distributions of the PDM and AOV statistics under a given trial period.In practice, many trial periods are tested and the period that minimizes the PDM or maximizes the AOV statistic is taken as the estimated period in the data.The FAP of the minimum PDM or maximum AOV statistic, however, does not follow these distributions and would need to be derived in a similar manner as the FAP for the LS power; this is left for future work.However, we reiterate here that the distributions of the minimum PDM and maximum AOV statistics, and their corresponding FAPs -once derivedwill automatically control for "testing" a range of trial periods, and multiple-testing is not a concern in that case.

String-Length Method
Another popular phase-based method is the SL method suggested by Burke Jr et al. (1970) which measures the fit of the trial period p based on the Euclidean distance between neighboring time-flux pairs after phase-folding.Specifically, the SL is defined as Note, since the desired function is periodic, the distance between the first and last points is also included (as the first term) with an adjustment for the difference in phase values.The period p that minimizes this distance is taken as the estimated period for the observed data values.Renson (1978) criticized this method because the observed flux or magnitude values have different units from the phase values.To mitigate this issue under the assumption of Keplerian variation, Dworetsky (1983) suggested scaling the observed values as Xt = (X t − X min )/2(X max − X min ) − 0.25.Under our H 0 in equation (3), we can achieve homogeneity and ensure the SL statistic is robust to changes in units by scaling the observed values as Xt = (X t − X min )/(X max − X min ), which bounds both ϕ and Xt to be between zero and one.
When the observations are deterministic and from a continuously differentiable function, then L can be determined using arc length equations from geometry.However, in the statistical setting, we must start with the distribution of the observed values.Specifically, under H 0 , it is easy to show using standard transformation of random variable techniques that Casella & Berger (2021) for these techniques.Unfortunately, however, the sum of these variables does not have a closed form distribution.Because of this difficulty, simulations or bootstrapping methods, with potential modifications as mentioned above, are needed to determine the FAP of the period that minimizes L.

Lomb-Scargle Power
The phase-based methods discussed in Section 5 are designed to work for time domain values of periodic functions, i.e. observations of the form (1).Alternatively, the LS power examines the data for a periodic component in the frequency domain using the Fourier transform, i.e. observations of the form (2). The square of the Fourier transform is referred to as the power spectral density and is a measure of how much each frequency characterizes a periodic function (Priestley 1981, VanderPlas 2018).Knowing the power spectral density of a function is equivalent to knowing the periodic function since any periodic function can be reconstructed as a sum of sines and cosines with the frequencies from its power spectral density (Priestley 1981).The LS power aims to estimate the power spectral density when the data are unevenly sampled in time (i.e. when classical time series methods do not apply).Specifically, the LS power for trial frequency f is , where cos(4πf t) (Lomb 1976, Scargle 1982, VanderPlas 2018).
When the data are generated from equation (1), 2P (f )/σ 2 ∼ χ 2 2 for each f which is established by Cochran's theorem and the fact that t sin[2πf (t−τ )] cos[2πf (t−τ )] = 0, for any f .Further, it can be shown that P (f )/σ 2 ∼ exp(1) where exp(1) is the exponential distribution with scale parameter one (Casella & Berger 2021, Scargle 1982, VanderPlas 2018).Finally, because having data with mean zero and known σ 2 is rare, the LS power is often calculated with data that has been centered around the empirical mean and scaled using the empirical variance.We refer to 2P (f )/σ 2 using these empirical values as the scaled LS power.
One of the main difficulties of period detection using the LS power is determining the significance of the largest peak.This difficulty is because the power at one frequency is correlated with the power at other frequencies through a convolution of the true signal and the observation times, where a convolution is defined as f * g(t) := ∞ −∞ f (τ )g(t−τ )dτ (VanderPlas 2018).In the case of Gaussian errors as in equation ( 1), Baluev (2008) adapted theory (based on work from Davies (1977), Davies (1987), and Davies ( 2002)) for approximating the distribution of the maximum of chi-squared random variables over frequencies f , for the largest LS power.While this approximation gives an analytic result that can be easily implemented, it is an upper bound and can greatly overestimate the type 1 error rate when aliasing occurs (VanderPlas 2018).In Figure 2, we examine the type 1 error rate of the maximum LS power compared to the upper bound derived in Baluev (2008).We follow VanderPlas (2018) and generate 100 unevenly sampled observations m 1 , . . ., m 100 from the set {0, 1, . . ., 5• 24} (i.e. 100 observations over 5 days) and rescale the observation times as t i = 0.01 • m i for i ∈ {1, . . ., 100}.Following VanderPlas (2018), we generate a grid of frequency values from 1/(24 • 25) to 24 • 10.For 10,000 iterations, we randomly generate the flux values from a Gaussian distribution with σ 2 = 1 (i.e.under the null hypothesis) and calculate the LS power at each point in the frequency grid.These datasets are used to calculate the FAP directly.Additionally, we create a bootstrapped version of one of the datasets and calculate the corresponding LS power for the same frequency grid for 10,000 iterations.These bootstrapped datasets are used to calculate the FAP using bootstrapping methods that would be necessary in practice without an analytical bound.Specifically, the direct and bootstrap FAPs are calculated as follows.First, we create a grid over the range of observed maximum LS power values.Then for each grid point, we calculate the percentage of the observed maximum LS power values that are larger than it.Finally, the Baluev bound is calculated using equation (6) of Baluev (2008).

Simulation Study
The purpose of our simulation study is to demonstrate the properties of the classical period detection statistics discussed in this paper predominately under H 0 .How these properties change under heteroscedastic error and a particular case in which H a is true are explored as well but are for demonstration purposes.Exhaustive power analyses can be conducted for detecting a non-constant periodic function in the data using these statistics but would need to be performed based on the signal and noise levels expected for a given application.Thus, further collaborations and research is needed to conduct such power analyses.Each subsection below details how the data were generated and examines the properties of the statistics discussed above under H 0 with homoscedastic error, H 0 with heteroscedastic error, and H a with homoscedastic error.For all scenarios, we followed Mowlavi et al. (2023) and used a frequency range between 0.005 and 15 d −1 (a period range of 1.6 hours to 200 days) with a fixed frequency step of 10 −5 d −1 .We chose this grid because our simulation study under H a is based on one of the models fit to a binary star system found in Gaia data by the same group.

Properties of Statistics and Periodograms under H 0
To demonstrate the properties of the PDM, AOV, SL, and LS power statistics under H 0 with homoscedastic error, we generate 300 simulated datasets based on observed eclipsing binary data from the Gaia telescope.Our generation of simulated data can be decomposed into three main steps.First, we identify Gaia sources that have been labeled as eclipsing binaries by Mowlavi et al. (2023).Second, we determine the average observed flux and average observed flux error for these eclipsing binaries and generate simulated G-magnitudes under H 0 and corresponding errors based on these observed values.Finally, the observation times of the simulated data are generated based on actual observation times from a randomly selected source from Step 1.
We perform Step 1 as follows.First, we keep Gaia observations with Gmagnitudes between 18.3 and 19.5; these apparent magnitudes were chosen because they approximately cover the range of G-magnitudes that are most likely to be observed and are the range of magnitudes that will be simulated under H 1 (Mowlavi et al. 2023).We then merge this data with the data used in Mowlavi et al. (2023) to obtain sources that have been labeled as eclipsing binaries.
Step 2 is needed because the original observed weighted-mean flux values from Gaia have heteroscedastic error and the data under H 0 is assumed to have homoscedastic error.We calculate a reasonable homoscedastic error by determining the average flux error of the observations from Step 1 where we define the average flux error as the product of the average observed weighted-mean flux and the average flux error ratio (observed flux error divided by observed flux).We denote this new average flux error as σ f .The simulated weighted-mean flux values, f , are then generated as random draws from a normal distribution with the mean equal to the average observed weighted-mean flux and standard deviation σ f .The simulated G-magnitudes, G, are then calculated where G 0 = 25.6874(Carrasco et al. 2016, Riello et al. 2021).Additionally, using the Delta Method for the first term in equation ( 9) (Casella & Berger 2021), the standard deviation of G is approximately where σ G 0 = 0.0028 is the zero point uncertainty from Riello et al. (2021).The final simulated G-magnitudes are then set as G/σ G , which we refer to as the scaled Gmagnitudes.In other words, the scaled G-magnitudes generated in this way correspond to the X t in equation ( 1) with a constant periodic signal, i.e. under H 0 , and (approximately) homoscedastic noise.We generate 43 such simulated G-magnitudes because that was the average number of observations a source had in our merged data from Step 1. Finally, the observation times for the simulated data under H 0 are set.To do this, we randomly selected a source from Step 1 with 43 observations.For the evenly-spaced observation scenario, we generated a grid of observation times between the first and last time stamps of the selected source.Otherwise, for the unevenly-spaced observation scenario, we set the observation times as the actual observation times from the source.Figure 3 shows the scaled G-magnitudes with their corresponding observation times used for calculating the classical period-detection statistics discussed in Sections 5 and 6.
Figures 4 and 5 show the phase-folded data with histograms of the corresponding AOV, scaled PDM, SL, and scaled LS power statistics for evenly-and unevenly-spaced observation times, respectively.The solid black line is the density of the central F distribution for the AOV statistics, the density of the central beta distribution for the scaled PDM statistics, and the density of the chi-squared distribution with two degrees of freedom for the scaled LS power.The plotted periods are chosen for comparison to the case where the data contain a non-constant periodic function in Section 7.3.As demonstrated by Figures 4 and 5, the AOV, PDM, and LS power statistics follow their respective theoretical distributions under H 0 as shown in Sections 5 and 6.Not displayed, however, is that the scaled PDM statistic calculation breaks if all of the phase-folded times are in the same bin.This can and does occur when the observation times are evenly-spaced but does not occur in the unevenly-spaced case.When this happened, we set the scaled PDM to be missing.
Figure 6 shows the maximum AOV and scaled LS power and minimum scaled PDM and SL statistics.Clearly, the distributions of the extreme values do not follow the theoretical distributions under H 0 discussed in Sections 5 and 6.As such, the FAP using these statistics would need to be calculated using their respective extreme value distributions, of which only the LS power has a known bound.
Finally, Figure 7 shows the periodograms of the AOV, scaled PDM, SL, and scaled LS power statistics over the range of trial periods.The black line is the mean periodogram at each trial period, and the blue lines are the mean periodogram plus and minus the Monte Carlo standard deviation at each trial period.The periodograms of the AOV, SL, and LS power statistics suggest the data come from a constant function since there is no clearly dominating peak.On the other hand, the periodogram of the scaled PDM has multiple peaks when the data have evenly-spaced observation times that suggest the data could come from a non-constant periodic function.This issue, however, is alleviated when the observation times are unevenly-spaced but is still concerning as additional resources would need to be expended to determine that the result is spurious.

Robustness of Statistics and Periodograms under Heteroscedasticity
Now, we explore how robust the PDM, AOV, SL, and LS power values are under H 0 with heteroscedastic noise.We do this under two scenarios.In the first scenario, we generate the data exactly like how the data was generated in Section 7.1 but we do not rescale G by σ G .In the second scenario, we set the weighted-mean flux error to increase from the minimum observed flux error ratio to the maximum observed flux error ratio of the data in Step 1. Again, we do not rescale G by σ G .The second case aims to represent the degradation or mis-calibration of an instrument over time with the more reaslistic use case of the (unscaled) G-magnitudes, while the first case isolates the effect of not scaling the G-magnitudes.For both cases, the observation times are unevenly-spaced as described above.Figure 8 shows the observed data under these two cases.
Figures 9 and 10 are the same as Figure 5 but for the heteroscedastic data generated for this section.Unsurprisingly, the AOV, scaled PDM, and scaled LS power statistics are clearly robust to the heteroscedasticity from not scaling G by σ G in that they  The first row displays the simulated phase-folded, scaled G-magnitude values generated as independent and identical Gaussian realizations evenly spaced in time.The observed bin means are indicated by the black lines and the observations within each bin are the same shade of grey.The second, third, fourth, and fifth rows display histograms of the AOV, scaled PDM, and SL statistics and scaled LS power, respectively.The solid black line is the density of the central F distribution in the second row, the central beta distribution in the third row, and the chi-squared distribution with two degrees of freedom in the fifth row.As the SL statistic does not have a closed-form density function, none is plotted here.The plotted periods are chosen for comparison to the case where the data contain a non-constant periodic function in Section 7.3.still follow their theoretical distributions under H 0 ; this is because σ G is also random (from the randomness in f ) and does not depend on time.Similarly, the SL statistic appears to be robust to the heteroscedasticity introduced from not re-scaling G as the histogram in Figure 9 closely resembles that from Figure 5. Surprisingly, however, the AOV, scaled PDM, and scaled LS power statistics are fairly robust to increasing noise over time.This is likely due to the phase-folding shuffling the observations so that the observations appear to have homoscedastic noise.Although, for the trial period of 75.019 days the AOV statistic appears to decrease while the PDM statistic appears to increase; this is not terribly concerning, though, as these values are unlikely to result   in a false alarm.However, the SL statistic does not appear to be robust to increasing noise over time as it appears to decrease noticeably from the values observed in Figure 5.This shift is concerning as smaller values of the SL statistic may result in a false alarm.
Finally, Figure 11 is the same as Figure 7 except for the data generated in this subsection.Interestingly, the spurious results in the scaled PDM periodograms discussed in the previous subsection disappear with the addition of heteroscedasticity.All of the periodograms suggest the data come from a constant function.

Power Analysis of Statistics and Periodograms under H a
Finally, we explore how the PDM, AOV, SL, and LS power statistics perform under a particular example in which H a is true, i.e. a case in which the data come from a non-constant periodic function.To generate this data, we chose one of the fitted models from Mowlavi et al. (2023) that had well separated fitted Gaussians which mimicked a well-detached binary system.We chose this type of model as this would be the easiest case to determine the periodic component of the data and thus demonstrate the ability, the power in statistical terms, of these statistics to detect a non-constant period in a realistic data scenario.
We arbitrarily set a period of 150 days for the binary system and set the weightedmean flux as the inverse of equation ( 9) where G is determined from the Mowlavi et al. (2023) model.We then generated 300 datasets by randomly drawing f from a normal distribution with the new weighted-mean flux as the mean and σ f from Section 7.1 as the standard deviation, converting f to G using equation ( 9), and rescaling the simulated G-magnitudes to be G/σ G .Finally, we set the observation times to be unevenly-spaced as described above.Figure 12 shows the underlying periodic function and the simulated values analyzed in this subsection.Figure 13 is the same as Figure 5 but for the data generated as under H a for this section.Clearly, none of the statistics follow their distributions under H 0 as all of the observed values are in the tails of these distributions.Finally, Figure 14 shows the maximum AOV and scaled LS power and minimum scaled PDM and SL statistics for this subsection's data.Comparing with Figure 6, it is clear the maximum observed AOV and minimum scaled PDM and SL statistics using data from a non-constant periodic function differ from those corresponding statistics using data generated from a constant function.Oddly, this does not appear to be the case for the LS power.Further, the  AOV statistic seems to be the most sensitive to detecting a periodic function because the observed maximum values are fairly far away from those observed under H 0 .In fact, the difference between the smallest observed maximum under H a and largest observed maximum under H 0 is about 6.The PDM statistic seems the next most sensitive to detecting a (non-constant) periodic signal because the observed minimums under H a are well-separated from those observed under H 0 ; however, it is possible that some of the values observed under H 0 may be in the tail of the distribution under H a but are unobserved due to only performing 300 simulations.In fact, the difference between the smallest observed minimum under H 0 and the largest observed minimum under H a is about 0.024.Finally, based on the simulations the SL statistic seems the least sensitive to detecting a (non-constant) periodic signal for the same reason as the PDM statistic and that the observed minimums under H a actually appear as observed minimums under H 0 .These simulations agree with the observation made in Feigelson et al. (2021) that while the SL method avoids the arbitrary binning of observations, it is less sensitive to detection of the true period in practice.Finally, Figure 15 shows the periodograms of the AOV, PDM, and SL statistics and the LS power values for the data generated for this subsection.All of the periodograms have a strong peak at a false period of 75 days, which is due to both of the dips in Figure 12 aligning as can be seen in the top-left panel of Figure 13.This is a common issue for periodograms (Hara & Ford 2023).Interestingly, all of the phase-based methods have  a stronger peak at the correct period of 150 days but the LS power periodogram does not.Finally, the AOV statistic appears to have smaller fluctuations at incorrect periods compared to the PDM and SL statistics.

Concluding Remarks
Detecting a periodic signal in data is a historic and on-going area of research for astronomers.Much research has been conducted on and using the PDM (Stellingwerf   1978), AOV (Schwarzenberg-Czerny 1989), and SL (Dworetsky 1983) statistics, and the LS power values (Lomb 1976, Scargle 1982) by astronomers.However, there has been little input from statisticians which has created concerns about the theoretical and empirical properties of these methods (Feigelson et al. 2021).Our paper aims to fill this gap.We confirm that a scaled version of the PDM statistic follows a beta distribution, the AOV statistic follows an F distribution, and the scaled LS power follows a chisquared distribution with two degrees of freedom when the data are perturbations of a constant function.However, the SL statistic does not have a closed-form distribution.We emphasize that calculation of the FAP, or p-value in the statistics literature, can only use these distributions if one is only concerned about the FAP for a particular trial period.However, as demonstrated by Figure 6 the FAP would need to be calculated under a different distribution if an extreme value such as the minimum or maximum is used to test for a non-constant periodic signal.In the case of the SL statistic, a modification of the bootstrap method would need to be employed for any calculation of the FAP, e.g.those suggested by Fukuchi (1994).Finally, as demonstrated by Figure 14, we find the AOV statistic is most sensitive to a non-constant periodic signal while the SL statistic and LS power are the least sensitive.Additionally, the phase-based methods are less sensitive to false periods, particularly the AOV statistic, compared to the LS power.These results align with the conclusions drawn by Heck et al. (1985).
However, our simulation study focuses on the properties of these statistics under the null hypothesis that the data are fluctuations from a constant function and performed little exploration on the power of these statistics to detect a periodic signal in data.Further collaboration and research is needed to perform power analyses (how well these statistics detect periodic signals) that astronomers would find useful, such as those discussed in Heck et al. (1985).Finally, additional research is needed to identify ways to overcome challenges such as faint signals, irregular observing cadences, and interactions between observations and noise (Feigelson et al. 2021, Hara & Ford 2023).

Figure 1 :
Figure 1: This figure demonstrates how phase-folding can be used to construct tests of trial periods

Figure 2 :
Figure2: Comparison of the false alarm probabilities calculated directly and using bootstrap resampling versus the theoretical bound provided byBaluev (2008) for the largest observed LS power.

Figure 3 :
Figure 3: Simulated observed scaled G-magnitude under the null hypothesis that observations are Gaussian noise.Only the time stamp differs in the even-and unevenlyspaced observation cases.Note, the scale of the G-magnitude is different from typically observed values due to dividing by σ G .

Figure 4 :
Figure4: The first row displays the simulated phase-folded, scaled G-magnitude values generated as independent and identical Gaussian realizations evenly spaced in time.The observed bin means are indicated by the black lines and the observations within each bin are the same shade of grey.The second, third, fourth, and fifth rows display histograms of the AOV, scaled PDM, and SL statistics and scaled LS power, respectively.The solid black line is the density of the central F distribution in the second row, the central beta distribution in the third row, and the chi-squared distribution with two degrees of freedom in the fifth row.As the SL statistic does not have a closed-form density function, none is plotted here.The plotted periods are chosen for comparison to the case where the data contain a non-constant periodic function in Section 7.3.

Figure 5 :
Figure 5: Same as Figure 4 except the observed data are unevenly spaced in time.

Figure 6 :
Figure6: Maximum AOV statistic and scaled LS Power and minimum scaled PDM and SL statistics for data generated under H 0 with evenly-(top) and unevenly-(bottom) spaced observations.Note, we have excluded the densities under the null displayed in Figures4 and 5because the empirical densities here are very different from those for a single trial period/frequency.Specifically, note the changes in the axes values.

Figure 7 :
Figure 7: Periodograms of the AOV (left), LS power (middle-left), scaled PDM (middle right), and SL (right) statistics over a range of trial periods for evenly-(top) and unevenly-(bottom) spaced observation times under H 0 with homoscedastic error.The black line is the mean periodogram for 300 simulated data sets, at each trial period, and the blue lines are the mean periodogram plus and minus the Monte Carlo standard deviation, at each trial period.Note, the y-axes are the values of the respective statistics in the titles.

Figure 9 :
Figure 9: Same as Figure 5 but with generated data containing heteroscedastic noise from not rescaling G by σ G .

Figure 10 :
Figure 10: Same as Figure 5 but with generated data containing noise that increases over time.

Figure 11 :Figure 12 :
Figure 11: Same as Figure 7 except top row is from data with heteroscedastic error from not rescaling G by σ G and bottom row from data containing noise that increases over time.

Figure 13 :
Figure 13: Same as Figure 5 but with generated data from a non-constant periodic function with homoscedastic noise and p 0 = 150.

Figure 14 :
Figure 14: Maximum AOV statistic and LS Power and minimum scaled PDM and SL statistics for data generated under H a with evenly-(top) and unevenly-(bottom) spaced observations.Note, we have again excluded the densities under the null.

Figure 15 :
Figure 15: Same as Figure 7 except data is generated from a non-constant periodic function with homoscedastic noise, i.e. under a potential H a , with true period p 0 = 150 (vertical line).