Maximum-likelihood Regression with Systematic Errors for Astronomy and the Physical Sciences. I. Methodology and Goodness-of-fit Statistic of Poisson Data

Massimiliano Bonamente; Yang Chen; Dale Zimmerman

doi:10.3847/1538-4357/ad9b1e

1. Introduction: Systematic Errors and Poisson Regression

It is common practice, in the physical sciences and related disciplines, to classify uncertainties or "errors" in quantities of interest into two categories: "random" or "statistical" errors on one hand, and "systematic" errors on the other. The two types of error are generally attributed to the following assumptions: (1) the method of measurement of the random variable yields an inherent sampling distribution of measurements, when the experiment is repeated under the same experimental conditions (e.g., a Normal or Poisson distribution); and (2) that there may be unknown sources of uncertainty, either in the measurement process itself or in the underlying theory, which may cause additional errors that do not average down with repeated measurements.

This subject is of practical importance for the measurement of physical quantities (e.g., H. H. Ku 1969; B. N. Taylor & C. E. Kuyatt 1994; ISO Guide 98-3 2008). While "random" errors have been the purview of probability and statistics since its beginnings, i.e., via the characterization of a random variable by its sampling distribution and moments, there seems to be no general consensus as to what is meant by "systematic" errors, and how to provide a probabilistic model for them. For example, comments on this subject by J. Glosup & M. Axelrod (1996), C. Eisenhart (1963), N. E. Dorsey & C. Eisenhart (1953), and H. Jeffreys (1966) suggest the presence of one or more additional random variables besides those of interest to the experimenter. These variables can be included in the statistical modeling but are considered uninteresting (i.e., a nuisance parameter such as a background level; e.g., L. Lyons 2008, 2020; G. Cowan 2019), or not be accounted for in the experimental design altogether (i.e., hidden or latent variables; e.g., A. Skrondal & S. Rabe-Hesketh 2004). In both cases, these variables have the potential to affect the measurement of the variable(s) of interest by biasing the estimation of the mean of target variables of interest, and also contributing additional variance (and possibly higher-order moments). This interpretation guides the model for systematic errors to be presented in this paper.

A review by D. van Dyk & L. Lyons (2023) describes several methods that are commonly used in the physical sciences to address sources of systematic error. Some of the simplest and more common methods use a one-parameter-at-a-time error propagation, or several parameters simultaneously (see, e.g., J. Heinrich & L. Lyons 2007). These methods are especially common, and have been the de facto standard for astrophysics (see, e.g., notable applications by W. L. Freedman et al. 2001; S. W. Allen et al. 2004). More complex frequentist and Bayesian methods are also available (e.g., R. D. Cousins & V. L. Highland 1992; H. Lee et al. 2011; M. A. Acero et al. 2022), depending on the data and type of application (i.e., point estimation, interval estimation, upper limits, goodness of fit). Certain such methods require subsidiary measurements to constrain the likelihood associated with the nuisance parameters, and to marginalize the posterior for the parameters of interest with respect to the nuisance parameters, e.g., see application to cosmological parameters by Planck Collaboration et al. (2016, 2014), or for the calibration of X-ray instruments by H. Lee et al. (2011) and J. J. Drake et al. (2006).

In regression applications with Normal data, it is customary to treat such systematic errors with the addition of a fiducial systematic variance to that of the data, assuming independence between measurement and systematic errors. This is the usual method to address sources of unknown systematic error in many physics and astronomy regression applications (e.g., P. R. Bevington & D. K. Robinson 2003; J. Nordin et al. 2008), especially when there is no additional information on the physical origin of the systematic error. A popular method for astronomers to include systematic errors in the linear regression is the bivariate correlated errors and intrinsic scatter (BCES) method of M. G. Akritas & M. A. Bershady (1996). That method, however, only applies to the linear regression, and does not explicitly provide a maximum-likelihood goodness-of-fit statistic.

For Poisson-distributed count data, on the other hand, the fixed relationship between the mean and variance of the distribution complicates the statistical treatment of systematic errors. Overdispersed count-data distributions, such as the negative binomial or the Poisson-inverse Gaussian (P-IG), are a possible solution to the problem (see, e.g., J. M. Hilbe 2011, 2014; C. Cameron & P. Trivedi 2013). However, it is often desirable to maintain the Poisson distribution of the data, both because of its simpler log-likelihood properties and the belief that the data do result from a fixed-rate Poisson process. Motivated by this challenge, a new method to deal with the regression of Poisson data with systematic errors was proposed by M. Bonamente (2023a), which is based on the modeling of systematic errors via an intrinsic model variance, instead of using an overdispersed model for the data themselves. Based on that idea, this paper presents in more detail the statistical framework for the regression of Poisson data in the presence of systematic errors, and certain asymptotic results on the Poisson deviance as its goodness-of-fit distribution.

This paper is structured as follows: Section 2 provides a review of maximum-likelihood regression with Poisson data and what is known concerning its goodness-of-fit statistic. Section 3 presents the details of the model for systematic errors for Poisson data, followed by the new proposed method of regression and its goodness-of-fit statistic in Section 4. The model is then tested in Section 5 with the aid of numerical simulations, Section 6 describes methods of use of this model, and Section 7 presents our conclusions.

2. The Maximum-likelihood Poisson Regression

This Section describes the data model and the goodness-of-fit statistic that results from the maximum-likelihood method of regression with Poisson data. Although other methods of regression are available, such as the least-squares method, the (generalized) method of moments (for a review, see, e.g., C. Cameron & P. Trivedi 2013) or the method of maximum entropy (e.g., E. T. Jaynes 1957), the maximum-likelihood criterion (e.g., R. Fisher 1925) is commonly used in the physical sciences, and it is the method considered in this study.

2.1. The Data Model

Regression of cross-sectional Poisson data via the maximum-likelihood method uses a data model of the type (x_i, y_i), for i = 1, ..., N independent measurements y_i ~ Poisson(μ_i) at different values of independent variables x_i. The maximum-likelihood method returns the Poisson log-likelihood ${ \mathcal L }(\hat{\theta })$ as the fit statistic,

$\begin{eqnarray}\begin{array}{rcl}{ \mathcal L }(\hat{\theta }) & := & \mathrm{ln}L(\hat{\theta })=\mathrm{ln}\left(\displaystyle \prod _{i=1}^{N}\frac{{e}^{-{\hat{\mu }}_{i}}{\hat{\mu }}_{i}^{{y}_{i}}}{{y}_{i}!}\right)\\ & = & \displaystyle \sum _{i=1}^{N}\left(-{\hat{\mu }}_{i}+{y}_{i}\mathrm{ln}{\hat{\mu }}_{i}-\mathrm{ln}{y}_{i}!\right)\end{array}\end{eqnarray} \tag{ 1 }$

where θ = (θ₁, ..., θ_m) are m adjustable parameters of a model y = f(x; θ) that can be nonlinear, and $\hat{\mu }=({\hat{\mu }}_{1},\ldots ,{\hat{\mu }}_{N})$ are the means of the Poisson distributions evaluated at the best-fit parameter values $\hat{\theta }$ , i.e., ${\hat{\mu }}_{i}=f({x}_{i};\hat{\theta })$ . For a textbook review of Poisson maximum-likelihood regression, see, e.g., C. Cameron & P. Trivedi (2013), M. James (2006), and M. Bonamente (2022).

The x variable, with its x_i fixed positions, represents the predictor or independent variable, and the response or dependent variable y_i represent an integer number of occurrences detected as a function of the predictor. This data model applies to a variety of data in the physical sciences and related disciplines, and it is also commonly used in other fields such as econometrics (C. Cameron & P. Trivedi 2013), where the most common model used is the simple linear model with m = 2 parameters. In this paper, the term "model" is used to indicate interchangeably what astronomers might refer to as either a physical model (e.g., the type of radiation emitted by an astronomical source), or a detector model that characterizes the method of detection and how counts are collected by the instrument. The term is therefore used in this paper in a general sense to indicate a functional form for the parent mean μ_i for the Poisson distribution of detected counts, according to Equation (1).

In certain disciplines, it is common to use several predictors for the response, e.g., $x\in {{\mathbb{R}}}^{k}$ with k ≥ 1, typically using a linear model (i.e., the multiple linear regression, e.g., C. R. Rao 1973; C. Cameron & P. Trivedi 2013) or generalized linear models (GLMs; e.g., P. McCullagh & J. Nelder 1989). In applications for astronomy and the physical sciences, however, emphasis is placed on a physically motivated and often nonlinear model f(x) that is a function of just one predictor, k = 1. In astronomy, a scalar x often represents either time or the energy of a particle or a photon, and the associated data are referred to, respectively, as time series (or light curves) and spectra (see, e.g., E. D. Feigelson & G. J. Babu 2012). The class of GLMs with intercept, μ_i = η(X_i θ^T) with ${X}_{i}\in {{\mathbb{R}}}^{m+1}$ and η(·) the link function (e.g., P. McCullagh & J. Nelder 1989), is included in the general class of functions f(x, θ) considered in this paper. The results to be presented in this paper apply to predictors with any dimension, k≥1, although examples and simulations will only be reported for scalar predictors with $x\in {\mathbb{R}}$ .

2.2. The Poisson Deviance

The Poisson deviance is defined as twice the difference between the maximum achievable log-likelihood ${ \mathcal L }(y)$ and that of the fitted model ${ \mathcal L }(\hat{\theta })$ ,

$\begin{eqnarray}\begin{array}{rcl}{D}_{P} & := & 2({ \mathcal L }(y)-{ \mathcal L }(\hat{\theta }))\\ & = & 2\displaystyle \sum _{i=1}^{N}\left({y}_{i}\mathrm{ln}\left(\frac{{y}_{i}}{{\hat{\mu }}_{i}}\right)-({y}_{i}-{\hat{\mu }}_{i})\right),\end{array}\end{eqnarray} \tag{ 2 }$

and it is the goodness-of-fit statistic of choice for the Poisson regression, where ${\hat{\mu }}_{i}={\mu }_{i}(\hat{\theta })$ is the best-fit Poisson means obtained for the best-fit parameters $\hat{\theta }$ (e.g., C. Cameron & P. Trivedi 2013). The Poisson deviance differs from Equation (1) via the subtraction of ${ \mathcal L }(y)$ , which is the likelihood for the saturated model (i.e., the model in which the N means are equal to the measured data), and it is not a function of the model parameters. The Poisson deviance statistic is also referred to as the G-statistic (Y. Bishop et al. 1975).⁴

This statistic has received much attention for the goodness-of-fit assessment of high-energy astronomy count data. In this field, the D_P statistic is known as the Cash or C statistic, and it is commonly referred to as ${C}_{{\rm{\min }}}$ , after the pioneering work of X-ray astronomer W. Cash and others (W. Cash 1976, 1979; S. Baker & R. D. Cousins 1984). The Cash statistic is also the fit statistic of choice for the major high-energy astronomy spectral packages (K. A. Arnaud 1996; J. S. Kaastra et al. 1996; S. Doe et al. 2007).⁵ According to Equation (2), we define the function

$\begin{eqnarray}&&C(\mu ):= 2\displaystyle \sum _{i=1}^{N}\left({y}_{i}\mathrm{ln}\left(\frac{{y}_{i}}{{\mu }_{i}}\right)-({y}_{i}-{\mu }_{i})\right)=\displaystyle \sum _{i=1}^{N}{C}^{(i)}(\mu )\end{eqnarray} \tag{ 3 }$

where ${C}_{{\rm{\min }}}=C(\hat{\mu })$ is the usual Poisson deviance evaluated at the ML estimates of the Poisson means. The definition of Equation (3) implies that y_i ~ Poisson(μ_i) be independent measurements, as per the data model of Section 2.1.

2.3. Distribution of the Poisson Deviance

According to Equation (2), the Poisson deviance D_P is a likelihood-ratio statistic, as already pointed out by W. Cash (1979). A likelihood-ratio statistic, as originally proposed by J. Neyman & E. S. Pearson (1928), is generally defined by the ratio of two likelihoods,

$\begin{eqnarray}&&{\rm{\Lambda }}=\frac{\mathop{\sup }\limits_{{{\boldsymbol{\theta }}}_{0}}L(\theta | y)}{\mathop{\sup }\limits_{{\boldsymbol{\theta }}}L(\theta | y)},\end{eqnarray} \tag{ 4 }$

where ${\boldsymbol{\theta }}\subset {{\mathbb{R}}}^{m}$ is the full parameter space, and θ₀ ⊂ θ is a subset of parameter space representing the null hypothesis. A general result for the asymptotic distribution of certain likelihood-ratio statistics is provided by the Wilks theorem (S. S. Wilks 1938; S. Wilks 1962).⁶ For a simple null hypothesis in which all of the m-parameters are fixed, the Wilks theorem may be reported as

$\begin{eqnarray}&&-2\,\mathrm{ln}{{\rm{\Lambda }}}^{(m)}=-2({ \mathcal L }({\theta }_{0}^{(m)})-{ \mathcal L }(\hat{\theta }))\mathop{\unicode{x0007E}}\limits^{a}{\chi }^{2}(m)\end{eqnarray} \tag{ 5 }$

in the asymptotic limit of a large sample. In Equation (5), ${ \mathcal L }({\theta }_{0}^{(m)})$ is the log-likelihood for a simple null hypothesis H₀ represented by a set of fixed (or true) parameters θ = θ₀, with θ₀ ∈ θ₀; and ${ \mathcal L }(\hat{\theta })$ is the usual log-likelihood obtained by using the m best-fit maximum-likelihood parameters that were left free in the fit.

The Wilks theorem also applies to a composite hypothesis with s < m of the parameters fixed at a reference (or true) value, and the remaining m–s parameters being free to adjust to their best-fit value. In that case, the Wilks theorem results in

$\begin{eqnarray}&&-2\,\mathrm{ln}{{\rm{\Lambda }}}^{(s)}=-2({ \mathcal L }({\theta }_{0}^{(s)})-{ \mathcal L }(\hat{\theta }))\mathop{\unicode{x0007E}}\limits^{a}{\chi }^{2}(s).\end{eqnarray} \tag{ 6 }$

One of the key hypotheses for the applicability of the Wilks theorem is that partial derivatives of the Poisson log-likelihood with respect to θ_i are continuous (see, e.g., C. R. Rao 1973, Section 5f).

Within the context of likelihood ratios for D_P, the term ${ \mathcal L }(y)$ used in Equation (2) represents the log-likelihood for the saturated or perfect model, as described earlier. Unfortunately, the hypotheses of Wilks' theorem are not satisfied when the logarithm of the denominator of Equation (4) is ${ \mathcal L }(y)$ , as is the case for the Poisson deviance defined by Equation (2). This is because the perfect or saturated model in which ${\hat{\mu }}_{i}={y}_{i}$ is not described by the same continuous parameterization θ as the null model (term ${ \mathcal L }(\hat{\theta })$ ). Therefore, the Wilks theorem does not apply in general to the Poisson deviance D_P, and thus it is not possible to assume that D_P ~ χ²(N − m) in full, for all values of the parent mean and for any form of the null-hypothesis parametric model.

Asymptotic and approximate distributions for ${D}_{P}={C}_{{\rm{\min }}}$ are available in the literature. P. McCullagh (1986) found the conditional distribution of D_P for log-linear models in the extensive and sparse regime (i.e., large N and small μ) via the method of cumulants (P. McCullagh 1984), including first- and second-order corrections to the asymptotic χ² distribution that applies to the large-mean regime. Similar results were also obtained by J. S. Kaastra (2017) and M. Bonamente (2020), however limited to the case of a simple hypothesis. An in-depth analysis of the Poisson deviance, including more accurate approximations that generalize the P. McCullagh (1986) results to a broader class of models, is presented in Y. Chen et al. (2025, in preparation). A useful approximation to the distribution of the Poisson deviance is available in the large-mean regime, where D_P is approximately distributed as χ²(N − m), due to the approximation of the Poisson distribution to a Normal distribution in that regime (e.g., in the Gaussian limit of W. Cash 1979). In astrophysics literature, the large-mean regime is usually interpreted as approximately ≥10 or 20 counts per bin (e.g., see discussion in M. Bonamente 2022).

These complications are in contrast with the case of Normal data, where the maximum-likelihood goodness-of-fit statistic

$\begin{eqnarray}&&S=\displaystyle \sum _{i=1}^{N}\frac{{({y}_{i}-{\hat{\mu }}_{i})}^{2}}{{\sigma }_{i}^{2}}\end{eqnarray} \tag{ 7 }$

is distributed as S ~ χ²(N − m) for a variety of models and situations, under rather general conditions (e.g., Cramér's theorem; H. Cramer 1946; P. Greenwood & M. Nikulin 1996). For this reason, this statistic is usually referred to as the minimum chi-squared statistics ( ${\chi }_{min}^{2}$ ) in astronomical applications. This simple result for Normal data is one of the main reasons why some practitioners use the minimum chi-squared statistic for parameter estimation and hypothesis testing of Poisson data, even though such a method leads to biased results (e.g., P. J. Humphrey et al. 2009).

3. A Model for the Regression of Poisson Data with Systematic Errors

This Section provides the details of the proposed method of analysis, starting with a definition of the type of systematic errors that are being considered.

3.1. Systematic Errors as Source of Overdispersion

In the development of a statistical model of systematic errors, we take the narrow view that systematics only provide an additional source of variance beyond that implied by the distribution of the data, without affecting the mean. While this assumption may seem restrictive, consider that any systematic shift or bias in the measurement of y_i can in principle be accounted for by means of suitable model components that contribute to the model function f(x_i; θ), provided they are a reasonably smooth function of the regressor variable x.

Practically, this view of systematic errors means that if the response variable y_i is affected by systematic errors, the data analyst has corrected for the mean effect resulting from those errors. This is in fact what is routinely done in certain physics and astronomy applications, where such sources of systematics as instrumental gain or calibration are corrected with fixed factors that aim to provide a mean adjustment to the model (see, e.g., the calibration efforts for X-ray astronomy by M. Tsujimoto et al. 2011; P. P. Plucinsky et al. 2017). From an experimental viewpoint, this process results in a best-fit model that generally follows the data without systematic departures (in the mean), yet with a larger-than-expected variance.

In this work, we consider strictly independent measurements y_i, as already emphasized in Section 2.1; this is a key assumption of the method. In certain applications, data may be binned with a finer resolution than the effective detector resolution, thereby introducing a certain degree of correlation among the detected counts in adjacent bins, and therefore also among the systematics. The choice of bin size is primarily a data reduction problem that is not directly addressed by this work (see, e.g., J. S. Kaastra & J. A. M. Bleeker 2016), and therefore it is assumed that bin-by-bin independence is satisfied. Nonetheless, it is important to point out that calibration, and therefore correction-in-the-mean, are often carried out over a range of the independent variable that covers several bins simultaneously (e.g., J. J. Drake et al. 2006; H. Lee et al. 2011, for X-ray astronomy examples). This is likely to introduce a degree of correlation in the systematics between adjacent bins that is not accounted for in this model of systematic errors, which would require additional considerations.

A way of specifying the effect of systematic errors on the response y_i is illustrated via the compounding representation of random variables,⁷ with a nonnegative distribution of choice (such as the gamma) for the Poisson mean μ_i, leading to overdispersion. No restriction is placed on the origin of such overdispersion, i.e., on the shape of the compounding distribution for μ_i. In fact, we intend to develop a general-purpose model that enables (a) an estimate of the systematic variance, when the use of an a priori value is impractical; (b) a method of regression for Poisson count data in the presence of these errors; and, critically, (c) the evaluation of the distribution for the revised fit statistic (i.e., a deviance D_P in the presence of systematics) for the purpose of hypothesis testing. These are the main goals of the method to be presented in this paper.

3.2. Review of Models for Overdispersion in Count Data

Some of the more common distributions to model overdispersed integer data are the negative binomial (NB) distribution, which results from the compounding of a Poisson distribution with a gamma-distributed rate parameter (e.g., M. Greenwood & G. Y. Yule 1920), or the Poisson-inverse-Gaussian (P-IG) distribution (e.g., M. C. K. Tweedie 1957; H. Sichel 1971; A. C. Atkinson & L. Yeh 1982). Such compounding (or mixing) models have been used in a variety of applications across the sciences, see, e.g., D. Karlis & E. Xekalaki (2005), L. Hurtado-Gil et al. (2017), or G. Willmot (1986) for a review of Poisson mixing distributions and applications. In addition to the gamma and inverse Gaussian distributions, some of the other notable mixing distributions include the lognormal distribution (M. G. Bulmer 1974) and the truncated Normal distribution (G. Patil 1964). The NB distribution is commonly used in GLM models (e.g., the NB1 and NB2 parameterizations; P. McCullagh & J. Nelder 1989; C. Cameron & P. Trivedi 2013).

A method to account for the additional variance induced by systematic errors is therefore the direct use of one such compounded Poisson distribution for the data, in place of the default Poisson model of Section 2.1, or any other distribution that allows overdispersion, such as the generalized Poisson distribution (e.g., P. C. Consul & F. Famoye 1992). There are two major reasons, however, to seek an alternate route for certain applications. First, there is substantial added computational cost in the use of alternative distributions such as the NB or the P-IG. For example, the P-IG distribution makes use of Bessel functions of the second kind (e.g., M. Holla 1967; A. C. Atkinson & L. Yeh 1982); likewise, the likelihood associated with the NB is more complex than for the Poisson distribution (see, e.g., J. M. Hilbe 2011). More critical, however, is the fact that the maximum-likelihood goodness-of-fit statistic becomes practically intractable, even for the relatively less complex NB model (see, e.g., C. Cameron & P. Trivedi 2013). The situation would therefore become even more dire than for the relatively simple Poisson regression statistic D_P, which itself lacks the simple distributional properties of the chi-squared statistic for Normal data (see Section 2.3).

On the other hand, and moving away for a moment from "classical" compounded distributions, one could follow purely Bayesian methods for the inclusion of systematic errors in the analysis (see, e.g., J. M. Hilbe et al. 2017). Such methods would typically start with the choice of a prior distribution p(μ_i) to evaluate the posterior distributions for μ_i. In this case, assessment of the goodness-of-fit would be addressed via Bayes factors or by means of information criteria such as the Akaike information criterion (H. Akaike 1974), Bayesian information criterion (G. Schwarz 1978), or similar (for a review, see, e.g., P. Stoica & Y. Selen 2004; A. Gelman et al. 2013). Numerical evaluations are typically required for the posterior distribution, of a complexity that is probably of similar order to that of the numerical minimization of the D_P statistic for the classical approach, depending on a number of factors such as the complexity of the model and that of the prior distribution of choice. The main drawback, however, is that the Bayesian approach does not have a simple goodness-of-fit statistic to directly test the null hypothesis H₀ of a true model.

Certain applications may not require an accurate determination of the distribution of the goodness-of-fit statistic. In those cases, the NB or P-IG models or other compounded-variable variations, or alternatively Bayesian methods, may be the most appropriate route to model overdispersion. However, most physics and astronomy applications do require it, in that it is usually critical to perform an accurate test of a complex hypothesis as a means of validation or rejection of an underlying multiparameter theory. Examples in X-ray astronomy include most cosmological studies, such as the measurement of the Hubble constant (e.g., M. Bonamente et al. 2006; J. T. Wan et al. 2021), of the density of dark matter and dark energy (e.g., S. W. Allen et al. 2004; A. Vikhlinin et al. 2009a, 2009b; A. B. Mantz et al. 2014, 2022), of the cosmological density of baryons and the associated missing baryons problem (e.g., F. Nicastro et al. 2018; O. E. Kovács et al. 2019; D. Spence et al. 2023). This makes the classical compounded distribution method, or fully Bayesian methods, less desirable for this class of applications. The model proposed in the following aims to overcome this difficulty, by developing an approximate method of modeling systematic errors that is both computationally tractable even with limited resources (both computational and in terms of overall model simplicity), and more importantly leading to a simple and accurate method of hypothesis testing.

3.3. Proposed Model: The Intrinsic Model Variance and the Associated M_i Random Variable

We are now in a position to introduce our model for systematic errors in the regression of Poisson data. It is proposed that, after the usual Poisson regression is performed, the best-fit model $\hat{f}({x}_{i})$ be considered as a random variable indicated as M_i, as a means to model the presence of systematic errors. The mean of this random variable is naturally set to the ML estimate ${\hat{\mu }}_{i}$ itself, and its variance is set to a newly introduced intrinsic model variance parameter ${\sigma }_{{\rm{int}},i}^{2}$ , as proposed in M. Bonamente (2023a). This idea is illustrated in Figure 1 for a simple linear model, whereas the model has an intrinsic variability that is represented by the blue curve, representing the post-fit randomization of the best-fit linear model according to the random variable M_i.

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** Illustration of the method of regression with an intrinsic model variance. Data with N = 100 with a constant parent mean μ = 100 (dots) were fit to a linear model (solid line, ${\hat{\mu }}_{i}$ ). The blue curve represents an instance of randomization of the best-fit model according to a Normally distributed ${\mu }_{i}={M}_{i}\unicode{x0007E}N({\hat{\mu }}_{i},{\sigma }_{{\rm{int}},i}^{2})$ with a relative systematic error of ${f}_{i}:= {\sigma }_{{\rm{int}},i}/{\hat{\mu }}_{i}=0.1$ .
Download figure:
Standard image High-resolution image

Formally, the random variable M_i describes the distribution of the best-fit model $\hat{f}({x}_{i})$ , i.e.,

$\begin{eqnarray}&&\hat{f}({x}_{i})\mathop{:= }\limits^{d}{M}_{i},\,\rm{with}\,\,\rm{E}({M}_{i}):= {\hat{\mu }}_{i},\ {\rm{Var}}({M}_{i}):= {\sigma }_{{\rm{int}},i}^{2}.\end{eqnarray} \tag{ 8 }$

Any family of distribution functions with nonnegative support, such as the gamma distribution, can be used, and the choice of distribution for M_i is part of the process of modeling systematic errors. The choice of distribution for M_i is discussed in Section 4.5, and it will be shown not to be a critical task for most applications. Moreover, other choices could be made regarding the relationship between the parameters of the distribution of the M_i variable and the parameters ${\hat{\mu }}_{i}$ and σ_int,i. For example, the variance in Equation (8) could be further generalized to a function of the intrinsic model variance parameter, i.e., $\,\rm{Var}\,({M}_{i})=h({\sigma }_{{\rm{int}},i}^{2})$ where h( · ) is a positive-valued function. It is believed, however, that the present parameterization is probably sufficiently general for most applications.

The intrinsic model variance ${\sigma }_{{\rm{int}},i}^{2}$ is assumed to be known a priori, e.g., from even limited knowledge of the instruments or methods used for the detection. For example, counts measured from a detector of the type used in high-energy astrophysics (e.g., H. Lee et al. 2011) are proportional to the effective area of the instrument, and an estimated calibration uncertainty on the effective area may translate linearly to an estimate of the intrinsic model error σ_int,i. Alternatively, ${\sigma }_{{\rm{int}},i}^{2}$ can be estimated from the data, as will be shown below.

The intrinsic model scatter can be recast as a fraction f_i of the best-fit model, whereas

$\begin{eqnarray}&&{f}_{i}:= \frac{{\sigma }_{{\rm{int}},i}}{{\hat{\mu }}_{i}}\end{eqnarray} \tag{ 9 }$

is the fractional amount of systematic scatter relative to the best-fit model. This model of systematic errors only applies to small relative systematic errors, f_i ≪ 1, which is the common regime for a variety of applications such as the astrophysics data that motivated this model (D. Spence et al. 2023). In fact, large disagreements between data and model are more likely to be imputed to a genuine failure of the model, rather than to the presence of systematic errors.

The introduction of the random variables M_i according to Equation (8), and the associated intrinsic model variance ${\sigma }_{{\rm{i}}{\rm{n}}{\rm{t}},i}^{2}$ , are the main ingredients of this model for systematic errors in Poisson data. This model could also be applied to other data, e.g., Normally distributed measurements, since the definition in Equation (8) makes no assumptions on the distribution of the data themselves. Equation (8) embodies the change of perspective in the role played by systematic errors, i.e., the systematic variance is associated with the model after the regression is performed, rather than the data themselves. The random variable M_i plays a role that is somewhat similar to that of a Bayesian prior on the mean (as discussed in Section 3.2), in that it provides a probabilistic model that quantifies a prior belief on the intrinsic degree of variability of the parent model. The main difference, however, is that it is included in the probability model only after the regression is performed, and it does not have an effect on it. Therefore, the random variable M_i is used primarily for its effect on the distribution of the null-hypothesis goodness-of-fit statistic, as will be explained in detail in Section 4.

4. The Goodness-of-fit Statistic for the Poisson Regression with Systematic Errors

The introduction of the M_i random variable to model systematic errors is now incorporated into a statistical model for the regression of Poisson data. The main aim of this model is to derive a goodness-of-fit likelihood-ratio statistic that generalizes the usual ${C}_{{\rm{\min }}}$ , when systematic errors are present.

4.1. The Quasi-maximum-likelihood Estimation

The choice of introducing an intrinsic model variance and the associated random variables M_i to describe the model after the ML regression is performed (see Section 3.3) is borne out of convenience. With that assumption, the data y_i can in fact be viewed as being distributed according to the compounded distribution

$\begin{eqnarray}&&{y}_{i}\unicode{x0007E}{\rm{Poisson}}({M}_{i})\end{eqnarray} \tag{ 10 }$

with M_i given by Equation (8), in place of the usual y_i ~ Poisson(μ_i). Specifically, the moments requirements for M_i imply that the expectation remains E(y_i) = μ_i under H₀, while Var(y_i) > μ_i according to the usual compounding formulas. This means that the data can be viewed as retaining the same parent means, but with larger variance (i.e., overdispersed), as was the intent of the method (see Section 3.1).

With this interpretation, which will be further justified in Section 4.2 below, the usual Poisson maximum-likelihood method of estimation that we use for the regression becomes a quasi-maximum-likelihood estimation (e.g., C. Cameron & P. Trivedi 2013), which is guaranteed to continue providing consistent estimates assuming that the means are correct according to the null hypothesis (as shown in, e.g., C. Gourieroux et al. 1984a). In other words, the regression can be performed by minimizing the usual D_P statistic according to Equation (2), and therefore, there is no additional burden to the task of identifying the best-fit model in the presence of systematic errors.

Minimization of the D_P statistic must be achieved with numerical methods, even for the simple linear model (e.g., C. Cameron & P. Trivedi 2013; M. Bonamente 2022). In fact, there is no fully analytical solution to the maximum-likelihood equations beyond the simple one-parameter constant model, when the parameter is trivially estimated by the sample mean. A semianalytical solution to the maximum-likelihood fit for the two-parameter linear model for Poisson data was provided in M. Bonamente & D. Spence (2022), including analytical estimates of the covariance matrix (M. Bonamente 2023b), which can reduce the computational burden for the linear model. For more complex models, the Poisson log-likelihood must be maximized via numerical methods.

4.2. The Goodness-of-fit Statistic

We now turn to the goodness-of-fit statistic in the presence of systematic errors, which we refer to as ${C}_{{\rm{\min }},{\rm{sys}}}$ . Given that the M_i variables replace the simple ${\hat{\mu }}_{i}$ estimates in this model of systematic errors, this new goodness-of-fit statistic is formally ${C}_{{\rm{\min }},{\rm{sys}}}:= {C}_{{\rm{\min }}}({M}_{i})$ according to Equation (3), i.e.,

$\begin{eqnarray}&&{C}_{{\rm{\min }},{\rm{sys}}}=2\displaystyle \sum _{i=1}^{N}\left({y}_{i}\mathrm{ln}\left(\frac{{y}_{i}}{{M}_{i}}\right)-({y}_{i}-{M}_{i})\right).\end{eqnarray} \tag{ 11 }$

This statistic is therefore also immediately obtained as a likelihood-ratio statistic from the joint log-likelihood of the data distributed as in Equation (10), thus supporting our interpretation of Section 4.1.

It is also immediate to see that Equation (11) can be written as the sum of two terms. Referring to the statistic as Z for convenience, we can set

$\begin{eqnarray}&&{C}_{{\rm{\min }},{\rm{sys}}}:= Z=X+Y,\end{eqnarray} \tag{ 12 }$

of which $X:= {C}_{{\rm{\min }}}({\hat{\mu }}_{i})$ represents the usual D_P statistic according to Equation (2). The additional statistic Y is associated with the random variable M_i, and it is defined by

$\begin{eqnarray}\begin{array}{rcl}Y & = & {C}_{{\rm{\min }}}({M}_{i})-{C}_{{\rm{\min }}}({\hat{\mu }}_{i})\\ & = & 2\displaystyle \sum _{i=1}^{N}({M}_{i}-{\hat{\mu }}_{i})-{y}_{i}\mathrm{ln}\left(\frac{{M}_{i}}{{\hat{\mu }}_{i}}\right).\end{array}\end{eqnarray} \tag{ 13 }$

It is clear that, when there are no systematic errors, Y vanishes as ${M}_{i}={\hat{\mu }}_{i}$ identically. Moreover, since M_i is independent of the data y_i, Y is asymptotically Normal in the extensive data limit (i.e., large N) by the central limit theorem. ⁸ Accordingly, we can set the asymptotic distribution of Y as

$\begin{eqnarray}&&Y\mathop{\unicode{x0007E}}\limits^{a}N({\mu }_{C},{\sigma }_{C}^{2}),\end{eqnarray} \tag{ 14 }$

where the two parameters of the distribution represent the additional mean and variance that are induced by the presence of systematic errors.

This model of systematic errors has therefore achieved our primary goal to provide a simple generalization for the goodness-of-fit statistic for the Poisson regression in the presence of systematic errors. The first contribution to the new goodness-of-fit statistic $Z={C}_{{\rm{\min }},{\rm{sys}}}$ is thus the usual Poisson log-likelihood X = D_P, whose distribution is assumed to be independent of the shape of the model μ_i = f(x_i; θ). Therefore, the exact parameterization does not play a significant role in the determination of the overall distribution for ${C}_{{\rm{\min }},{\rm{sys}}}$ , provided that basic requirements on model identifiability are satisfied (e.g., B. Hoadley 1971; J. R. Dale 1986; Y. Chen et al. 2025, in preparation).

The additional contribution Y has a simple Normal asymptotic distribution and, moreover, it is independent of X under the null hypothesis. This is the case for the following reasons. First, as we just remarked, the null-hypothesis distribution of X is not dependent on ${\hat{\mu }}_{i}$ . Second, Y is by construction not dependent on the data y_i ~ Poisson(μ_i), since the M_i random variable defined in Equation (8) depends only on the best-fit means ${\hat{\mu }}_{i}$ , and on an intrinsic model variance that is assumed to be known a priori.

4.3. Estimation of the Parameters of Y

In order to estimate the asymptotic distribution of the Poisson goodness-of-fit statistic in the presence of systematic errors, as given in Equation (11), it is therefore necessary to simply estimate the mean and variance of Y in Equation (14). Detailed calculations to accomplish this task are provided in Appendix B, and key results are summarized below.

The moments of the Y distribution can be obtained via a Taylor series expansion of the logarithmic terms in Equation (13),

$\begin{eqnarray}\begin{array}{rcl}Y & \simeq & 2\displaystyle \sum _{i=1}^{N}({M}_{i}-{\hat{\mu }}_{i})\left(1-\frac{{y}_{i}}{{\hat{\mu }}_{i}}\right)\\ & & +\displaystyle \sum _{i=1}^{N}{y}_{i}{\left(\frac{{M}_{i}-{\hat{\mu }}_{i}}{{\hat{\mu }}_{i}}\right)}^{2},\end{array}\end{eqnarray} \tag{ 15 }$

by retaining second-order terms in Equation (2). Given that $\,\rm{E}\,({M}_{i})={\hat{\mu }}_{i}$ , the expectation of the terms in the first sum of Equation (15) are null, leaving a second-order correction of

$\begin{eqnarray}&&\,{\rm{E}}\,(Y)={\mu }_{C}\simeq \displaystyle \sum _{i=1}^{N}{\mu }_{i}\,{f}_{i}^{2},\end{eqnarray} \tag{ 16 }$

where $\,\rm{Var}\,({M}_{i})={\sigma }_{{\rm{int}},i}^{2}$ according to Equation (8). The ratio ${f}_{i}={\sigma }_{{\rm{int}},i}/{\hat{\mu }}_{i}$ is the fractional amount of scatter, and it is intended to be a number that is much smaller than one. This expectation will be referred to as the bias parameter μ_C associated with ${C}_{{\rm{\min }},{\rm{sys}}}$ statistic. This expectation can therefore be estimated from the data via

$\begin{eqnarray}&&{\hat{\mu }}_{C}\simeq \displaystyle \sum _{i=1}^{N}{y}_{i}\,{f}_{i}^{2}.\end{eqnarray} \tag{ 17 }$

The variance of Y is referred to as the overdispersion parameter ${\sigma }_{C}^{2}$ , and it evaluates to

$\begin{eqnarray}\begin{array}{rcl}\,\rm{Var}\,(Y) & = & {\sigma }_{C}^{2}\simeq 4\displaystyle \sum _{i=1}^{N}{\mu }_{i}\,{f}_{i}^{2}\\ & & +\displaystyle \sum _{i=1}^{N}({\mu }_{i}^{2}+{\mu }_{i})\,\,\rm{kurt}\,({M}_{i})\,{f}_{i}^{4}-\displaystyle \sum _{i=1}^{N}{\mu }_{i}^{2}\,{f}_{i}^{4},\end{array}\end{eqnarray} \tag{ 18 }$

where kurt(M_i) is the kurtosis (or normalized fourth-order central moment) of M_i, e.g., kurt(M_i) = 3 for a Normal distribution. The second and third terms on the right-hand side are the second-order correction to the approximate results already provided in M. Bonamente (2023a). This approximation applies again for small relative errors, f_i ≪ 1. This variance can be estimated from the data according to

$\begin{eqnarray}&&{\hat{\sigma }}_{C}^{2}\simeq 4\displaystyle \sum _{i=1}^{N}{y}_{i}\,{f}_{i}^{2}+\displaystyle \sum _{i=1}^{N}{y}_{i}^{2}\,{f}_{i}^{4}\cdot \left(\,\rm{kurt}\,({M}_{i})-1\right),\end{eqnarray} \tag{ 19 }$

see Appendix B for details.

Equations (17) and (19) are the main results for the estimation of the Y distribution, and they relate the key parameter of the model for systematic errors (i.e., the intrinsic model variance) to the bias and overdispersion parameters for the ${C}_{{\rm{\min }},{\rm{sys}}}$ statistic in the presence of systematic errors. In principle, one may retain higher-order terms in the power series expansions. The accuracy of these approximations is tested in Section 5, to show that for typical values of systematic uncertainties, the second-order corrections presented in this paper may be sufficient for most applications.

4.4. Asymptotic Distributions

In general, a simple closed form for the asymptotic distribution of ${C}_{{\rm{\min }},{\rm{sys}}}$ cannot be given in all regimes, as discussed in Y. Chen et al. (2025, in preparation). However, useful approximations can be given that cover most practical data analysis cases.

In the large-mean regime, X ≈ χ²(ν) due to the approximation Poisson(μ) ≈ N(μ, 2μ) for the distribution of the data (this is true even for data set with small N). Provided that Y is approximately Normally distributed for that value of N, then $Y\approx N({\hat{\mu }}_{C},{\hat{\sigma }}_{C}^{2})$ , and it follows that in the large-mean regime ${C}_{{\rm{\min }},{\rm{sys}}}$ is approximately distributed like the sum of a chi-squared and an independent Normal variable. The resulting distribution is referred to as the overdispersed chi-squared distribution B,

$\begin{eqnarray}&&{C}_{{\rm{\min }},{\rm{sys}}}\unicode{x0007E}B(\nu ,{\hat{\mu }}_{C},{\hat{\sigma }}_{C}^{2}),\end{eqnarray} \tag{ 20 }$

which is a special case of the gamma-Normal family of distributions, with a probability distribution function that is the convolution of independent χ²(ν) and $N({\hat{\mu }}_{C},{\hat{\sigma }}_{C}^{2})$ distributions (see M. Bonamente & D. Zimmerman 2024, for properties of this distribution).

For applications with extensive data and in the large-mean regime, it is possible to further approximate $X\mathop{\unicode{x0007E}}\limits^{a}N(\nu ,2\nu )$ , therefore bypassing the use of the overdispersed chi-squared distribution. Therefore, ${C}_{{\rm{\min }},{\rm{sys}}}$ is asymptotically Normal with mean and variance equal to the sum of means and variances of the two contributing statistics,⁹

$\begin{eqnarray}&&{C}_{{\rm{\min }},{\rm{sys}}}\mathop{\unicode{x0007E}}\limits^{a}N(\nu +{\hat{\mu }}_{C},2\nu +{\hat{\sigma }}_{C}^{2}).\end{eqnarray} \tag{ 21 }$

Therefore, for most applications in the large-mean and extensive data regime, this model of systematic errors for Poisson data results in a straightforward extension of the χ²(ν) distribution for the ${C}_{{\rm{\min }}}$ fit statistic in that regime, with the simple use of Equation (21) as the null-hypothesis distribution for the goodness-of-fit statistic.

In the low-mean regime, however, additional considerations are needed even in the asymptotic limit of extensive data. Y. Chen et al. (2025, in preparation) showed that the Poisson deviance X = D_P remains asymptotically Normal, but with mean and variance that will differ significantly from the simple large-mean values assumed in Equation (21). For extensive data in the low-mean regime,

$\begin{eqnarray}&&{C}_{{\rm{\min }},{\rm{sys}}}\mathop{\unicode{x0007E}}\limits^{a}N(\,\rm{E}\,({C}_{{\rm{\min }}}| \hat{\theta })+{\hat{\mu }}_{C},\,\rm{Var}\,({C}_{{\rm{\min }}}| \hat{\theta })+{\hat{\sigma }}_{C}^{2}),\end{eqnarray} \tag{ 22 }$

where the conditional mean $\,\rm{E}\,({C}_{{\rm{\min }}}| \hat{\theta })$ and the conditional variance $\,\rm{Var}\,({C}_{{\rm{\min }}}| \hat{\theta })$ of ${C}_{{\rm{\min }}}$ are provided by Theorems 8–10 of Y. Chen et al. (2025, in preparation). In the low-mean regime, there is therefore only the additional burden of evaluating the appropriate moments for ${C}_{{\rm{\min }}}$ , and the asymptotic distribution of ${C}_{{\rm{\min }},{\rm{sys}}}$ remains Normal for extensive data.

4.5. Distribution for M_i and Other Considerations

Provided that the relative level of systematic errors is small, f_i ≪ 1 as assumed throughout, the choice of distribution for M_i is not critical. According to the approximation of Equation (15), E(Y) depends only on the first- and second-order moments for M_i, and it is independent of other moments of the distribution of M_i. Equation (8) ensures that E(Y) is independent of the choice for the distribution of M_i. Moreover, Var(Y) has only a mild dependence on higher-order moments, according to Equations (19) or (B.16). It is nonetheless useful to illustrate in detail how the choice of distribution affects the estimate of the variance of Y. For this purpose, we use a gamma-distributed M_i variable.

Let M_i be distributed according to a gamma distribution with rate parameter α and shape parameter r, indicated by M_i ~ gamma(α, r). If the variable is required to have an expectation $\,\rm{E}\,({M}_{i}):= r/\alpha =\hat{{\mu }_{i}}$ and a variance $\,\rm{Var}\,({M}_{i})\,:= r/{\alpha }^{2}={\sigma }_{{\rm{int}},i}^{2}$ according to Equation (8), then the two parameters of the distribution are, respectively, $\alpha ={\hat{\mu }}_{i}/{\sigma }_{{\rm{int}},i}^{2}$ and $r={\hat{\mu }}_{i}^{2}/{\sigma }_{{\rm{int}},i}^{2}$ . The variance ${\hat{\sigma }}_{C}^{2}$ according to Equation (19) then becomes

$\begin{eqnarray}\begin{array}{rcl}{\hat{\sigma }}_{C}^{2} & \simeq & 4\displaystyle \sum _{i=1}^{N}{y}_{i}\,{f}_{i}^{2}+\left(3+6{f}_{i}^{2}-1\right)\\ & & \cdot \displaystyle \sum _{i=1}^{N}{y}_{i}^{2}\,{f}_{i}^{4},\end{array}\end{eqnarray} \tag{ 23 }$

where the term $6{f}_{i}^{2}$ , which reflects the excess kurtosis of the gamma distribution, has a small effect, given that f_i ≪ 1. Any other choice for the distribution of M_i can be treated in a similar way, and the resulting estimates for the variance of the Y distribution are in general provided by Equation (B.16).

Alternatively, a Normal distribution for M_i would reflect the assumption that there may be multiple sources for this variance that are likely to yield a Normal distribution, according to the central limit theorem. However, the fact that a Normal distribution allows negative values makes this choice reasonable only in the limit of ${\sigma }_{{\rm{int}},i}/{\hat{\mu }}_{i}\ll 1$ , whereas one ignores negative values of the Normal distribution that are untenable for the mean of a Poisson distribution.

Finally, it may be useful to remark that the Normal approximation for the Y statistic is simply a result of the central limit theorem for extensive data. This conclusion applies regardless of choice for distribution of M_i, provided the distribution has finite mean and variance, as is the case for most distributions including the Normal and gamma.

5. Monte Carlo Tests

The statistical model of systematic errors presented in Sections 3 and 4 is now tested with the aid of Monte Carlo numerical simulations that are representative of real data analysis cases, with the goal of illustrating the overall accuracy of this model to predict the distribution of ${C}_{{\rm{\min }},{\rm{sys}}}$ , and its limitations. Given the independence of the $X={C}_{{\rm{\min }}}({\hat{\mu }}_{i})$ statistic on the parameterization of the regression model f(x), as discussed in Section 4, it is sufficient to test the results with simple analytic models. For this paper, we choose a constant model with m = 1 parameter, and a linear model with m = 2 parameters. The Monte Carlo simulations are aimed at validating the distributions for the fit statistics X, Y, and Z with these simple models.

5.1. Linear Model with 10% and 5% Systematic Errors

For these simulations, a data set with N = 100 independent data points is drawn from a constant model with a parent mean μ = 100, in such a way that the Poisson data are in the extensive and large-mean regime. The data are fit to the same constant model, and also to a linear model y = a + b x to obtain the maximum-likelihood estimates ${\hat{\mu }}_{i}$ .

First, the X statistic is calculated according to Equation (2). Next, in order to simulate the presence of systematic errors, the best-fit models (both full and reduced) are randomized at each point according to a Normal distribution for M_i with parameters according to Equation (8), using a fixed value of f_i = 0.1 representing a 10% level of systematic errors. The fit statistic with these randomized models are referred to as Z and Z_r, with "r" denoting the fit to the reduced model. The contribution to the fit statistic due to the systematic error is then Y = Z − X. The same procedure was repeated 500 times to obtain the empirical cumulative distribution function (eCDF) of the statistics $Z={C}_{{\rm{\min }},{\rm{sys}}}$ , X and Y.

Tests of the agreement between the empirical and predicted distributions of the X, Y and Z statistics were performed with the Kolmogorov–Smirnov one-sample statistic ${D}_{N}\,=\max ({D}^{-},{D}^{+})$ (A. Kolmogorov 1933). The fit statistics X and X_r in the absence of systematic errors are expected to be distributed as χ²(N − m) according to the Wilks theorem. In this limit, this distribution is well approximated by a Normal distribution of same mean and variance. In the same limit, the Y and Y_r statistics are approximated by Normal distributions with mean and variance according to Equations (17) and (19), and the Z and Z_r statistics also by Normal distributions according to Equation (21). For all statistics X, Y, and Z, the corresponding D_N statistics for the comparison between the eCDF and the expected distribution have large p-values, indicating good agreement.

Additional Monte Carlo simulations similar to the one reported above were repeated with the level of systematic errors now set at 5%, or f_i = 0.05. As for the previous case, we obtained good agreement between theoretical and simulated statistics, according to the respective D_N statistics. The case of 5% systematic errors better represents a typical real-life data analysis situation where systematic errors may be invoked, rather than the case of 10% systematics. In fact, for N = 100 and μ = 100, a 5% systematic error leads to a bias ${\hat{\mu }}_{C}=25$ that is at the level of 25% of the expected null-hypothesis ${C}_{{\rm{\min }}}$ statistic (with E(X) = N − m ≃ 100). This figure rises to ~ 100% for the 10% systematic errors, i.e., a bias of ${\hat{\mu }}_{C}=100$ .

5.2. Linear Model with GLM Log-link

GLMs are commonly used in a variety of disciplines such as econometrics (e.g., C. Cameron & P. Trivedi 2013), and occasionally also in astronomy (e.g., R. de Souza et al. 2015, M. W. Hattab et al. 2019, S. C. Berek et al. 2023). In GLM models, the usual linear model with the two parameters θ = (a, b) is used with a linear predictor η_i = a + b x_i and with Poisson mean ${\mu }_{i}=\exp ({\eta }_{i})$ , which corresponds to the canonical log-link in the GLM notation, $\mathrm{ln}{\mu }_{i}={\eta }_{i}$ . This is a common choice for the analysis of Poisson data, to avoid possible nonpositive means with the usual linear model of Section 5.1. We therefore performed simulations with the same parameters as for the usual linear model that was used in the previous Section (with N = 100 and μ = 100), and obtain statistically indistinguishable results from the results with the traditional linear model. Quantitative results with both the traditional linear model and the log-link model are presented in Section 5.3 below.

A GLM model with the canonical logarithmic link corresponds to a Pareto or power-law model with Poisson mean ${\mu }_{i}=A\cdot {x}_{i,r}^{b}$ , with $A=\exp (a)$ and when the independent variable is on an exponential scale, ${x}_{i,r}=\exp ({x}_{i})$ . It is necessary to highlight that, for any physically based model y = f(x), the usual linear model and the GLM linear model with a canonical log-link are not equivalent. Given that the linear model with log-link is equivalent to a power-law model with exponentially spaced independent variables x_i,r, the GLM log-link linear model provides additional evidence of the independence of the results from model parameterization.

5.3. Comprehensive Tests of the Y Statistic

Comprehensive Monte Carlo tests for the mean and variance of the distribution of Y were performed as follows. A selected number of representative values for the N, μ, and f parameters were chosen, and 500 Monte Carlo realizations of the a constant model were used to generate random data sets that were then fitted to the same constant model, and also to a two-parameter linear model, the same as for the simulations presented in Section 5.1.

The mean and the variance of the resulting Y statistic for the fit to the linear model are shown in Figure 2, along with the predictions from Equations (17) and (19) that are reported as dashed lines. For all parameters, the Monte Carlo estimates of ${\hat{\mu }}_{C}$ and ${\hat{\sigma }}_{C}$ agree typically with the model predictions to within ± 10% for f_i ≤ 0.1, and often within just a few percent. This degree of agreement indicates that the Y random variable is sufficiently accurately described by Equation (14), in a wide range of model parameters that are typical for data analysis applications. Identical results were found for the Y_r statistic for the fit to the reduced model.

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** Results of the Monte Carlo simulations of the Y variable for a data set with a constant parent model. The parent value of the Poisson mean are coded according to color, and the symbols (also color-coded) are used to distinguish the values f = 0.01, 0.02, 0.05, 0.10, and 0.20 for the fractional systematic error. Dashed lines are the theoretical curves with the approximations of Equations (17) and (19). The f = 0.20 curves with color-coded right triangle symbols are used to illustrate the departure from the approximations when the systematic errors are too large, and are not used in applications. Data sets with N = 10, 50, 100, 500, and 1000 independent data points were simulated.
Download figure:
Standard image High-resolution image

The range of parameters for the simulation was limited by design to f_i ≤0.1, plus the case of f_i = 0.2 that is used to illustrate the errors in the estimates for larger values of systematic errors. Table 1 reports the average fractional deviations ${\eta }_{\mu }={\hat{\mu }}_{C}/{\mu }_{C}-1$ and ${\eta }_{\sigma }={\hat{\sigma }}_{C}/{\sigma }_{C}-1$ , and their standard deviations, according to the results shown in Figure 2. A 20% systematic error is outside the domain of approximations presented in this paper, and therefore the deviations seen in Figure 2 and Table 1 for f_i = 0.2 is simply a result of the inaccuracy of the Taylor series approximation; see Appendix Equation (B.1). In fact, large values of this parameter are not of practical use in data analysis applications, as already remarked in Section 5.1. Also, the low-mean regime μ≤10 is not tested by these simulations.

Table 1. Fractional Differences η_μ and η_σ between the Monte Carlo Simulations of μ_C and σ_C (Figure 2) and Their Predictions According to Equations (17) and (19)

$f={\sigma }_{{\rm{int}},i}/{\hat{\mu }}_{i}$	η_μ	η_σ
	Linear model

0.01	−0.012 ± −0.055	−0.055 ± 0.055
0.02	−0.004 ± 0.058	−0.030 ± 0.046
0.05	0.004 ± 0.023	−0.003 ± 0.037
0.10	0.027 ± 0.012	0.002 ± 0.053

0.20	0.081 ± 0.029	0.184 ± 0.095

	GLM with log-link

0.01	0.022 ± 0.160	−0.013 ± −0.047
0.02	0.021 ± 0.083	−0.024 ± 0.057
0.05	0.010 ± 0.041	0.000 ± 0.044
0.10	0.010 ± 0.030	0.002 ± 0.049

0.20	0.068 ± 0.015	0.148 ± 0.082

Download table as: ASCII Typeset image

In addition, another set of Monte Carlo simulations similar to those illustrated in Figure 2 was performed using the linear GLM with canonical link introduced in Section 5.2. In this case, we used a logarithmically spaced set of independent variables x_i between 0 and $\mathrm{log}(N)$ , to avoid possible overflow issues when implementing the exponential mean. The results of those simulations are statistically indistinguishable from those with the usual linear model; see Table 1.

The agreement between the Monte Carlo simulations for systematic errors with f_i ≤ 0.1 and the predictions according to the theory presented in Sections 3.3 and 4 validates this model as an accurate tool to estimate the goodness of fit for the regression of Poisson data in the presence of systematic errors.

6. Methods of Hypothesis Testing

In this Section we provide a summary of the use of this model for two classes of applications. In the first case, a known value for the intrinsic model variance σ_int can be used for the usual hypothesis testing of a measured value of ${C}_{{\rm{\min }}}$ . In the second case, when it is not possible or convenient to estimate the intrinsic variance a priori, the distribution of ${C}_{{\rm{\min }}}$ presented in this paper can be used to estimate the intrinsic variance itself, under the assumption that the null hypothesis is correct. We then provide an application to real-life astronomical data.

6.1. Hypothesis Testing with Known Systematic Errors

In many data analysis situations, it is possible to estimate a priori the level of systematic errors present in the measurements. This is the regime that was envisioned for the model of systematic errors developed in Section 3, where the σ_int,i parameters are assumed to be known. For example, inherent uncertainties associated with either or both the photon-counting device and the parent astrophysical process that generates the photons are common for the detection of high-energy photons from distant quasars, such as in the recent applications by D. Spence et al. (2023), F. Nicastro et al. (2018), and O. E. Kovács et al. (2019), among many others. In that class of applications, it is often common to use a simple power law or other phenomenological models such as a spline, to model the number of photons detected as a function of wavelength. The parent model is likely to be more complex and often unknown, giving rise to the need for a source of systematic error in the Poisson regression, as opposed to the employment of an alternative model.

For example, in D. Spence et al. (2023), we estimated a level of systematic errors in the calibration of the so-called effective area of the CCD detector at the few percent level for the main instrument used for those data. Given that the effective area of the instrument is linearly related to the measured counts, this uncertainty could be used as an a priori value of the parameter, say f_i ≃ 0.02, to test whether the resulting fit statistic ${C}_{{\rm{\min }}}$ is consistent with that value of the intrinsic model variance.

In this class of applications, assuming extensive data, hypothesis testing consists of two steps:

1.
First, the bias and overdispersion parameters μ_C and ${\sigma }_{C}^{2}$ of the ${C}_{{\rm{\min }},{\rm{sys}}}$ statistic are estimated from the data according to Equations (17) and (19), for a given value of the chosen level of systematic errors f_i;
2.
Then, the relevant parent distribution for ${C}_{{\rm{\min }},{\rm{sys}}}$ , as discussed in Section 4.4, is used to perform the usual one-sided null-hypothesis test with the measured value of ${C}_{{\rm{\min }}}$ , for a given p-value.

6.2. Estimate of the Intrinsic Model Variance from Data

In certain experimental cases, it is not possible or interesting to provide an accurate estimate of the level of systematic errors present in the problem a priori. This situation arises when the data regression yields an unacceptably high value of ${C}_{{\rm{\min }}}$ , and the data analyst (a) may not provide an alternative model for the regression, or (b) is interested in evaluating the level of systematic errors relative to such model, following the belief that the data are in fact accurately described by the model at hand.

In this class of applications, the parent distribution of ${C}_{{\rm{\min }},{\rm{sys}}}$ according to Equations (21), or (22) in the low-mean regime, can be used to estimate the intrinsic model variance ${\sigma }_{{\rm{int}},i}^{2}$ , i.e., the level of systematic errors, assuming that the regression model is the parent model. In this Section, the measured goodness-of-fit statistic will be referred to as lower-case ${c}_{{\rm{\min }}}$ , to distinguish it from the statistic itself. We also assume that the intrinsic level of systematic errors f_i is constant among the data points, and we indicate this value as f.

A point estimate for the average level of systematic errors f can be obtained by requiring $\,\rm{E}\,(Z)={c}_{{\rm{\min }}}$ , i.e., the measurement equals the expectation of the parent distribution. In general, the values of the mean and variance of X were discussed in Section 4.4, and in the large-mean regime, they are E(X) = (N − m) and Var(X) = 2(N − m). It follows that the bias parameter can be estimated as

$\begin{eqnarray}&&{\hat{\mu }}_{C}={c}_{{\rm{\min }}}-\,\rm{E}\,(X)\gt 0.\end{eqnarray} \tag{ 24 }$

Then, using Equation (17) and assuming a constant value for f, the estimate is simply obtained as

$\begin{eqnarray}&&\widehat{f\,}=\sqrt{\frac{{\hat{\mu }}_{C}}{{\sum }_{i=1}^{N}{y}_{i}}}.\end{eqnarray} \tag{ 25 }$

For example, a value of ${c}_{{\rm{\min }}}$ =125 for a data set with N = 100 and a parent mean μ = 100, leads to an estimated $\hat{f}=0.052$ according to Equation (25), similar to the one assumed for the Monte Carlo simulations with f_i = 0.05.

Confidence intervals for f can be obtained by requiring that the measurement ${c}_{{\rm{\min }}}$ corresponds to specified quantiles of the parent distribution for Z. For example, for a 68% confidence interval, the condition

$\begin{eqnarray}&&{c}_{\min }=\,\rm{E}\,(X)+{\hat{\mu }}_{C}\pm \alpha \times \sqrt{\,\rm{Var}\,(X)+{\hat{\sigma }}_{C}^{2}}\end{eqnarray} \tag{ 26 }$

can be used to obtain the lower and upper limit of the confidence interval for ${\sigma }_{{\rm{int}},i}/{\hat{\mu }}_{i}$ , upon which the bias and overdispersion parameters in Equation (26) depend. The equation is quadratic in f, and it can be further simplified if the overdispersion parameter ${\hat{\sigma }}_{C}^{2}$ is held fixed at the point estimate Equation (19). With this approximation, confidence intervals on the f parameter are estimated via

$\begin{eqnarray}&&{f}_{\,\rm{up,lo}\,}^{2}\simeq \displaystyle \frac{{c}_{\min }-\,\rm{E}\,(X)\pm \alpha \times \sqrt{\,\rm{Var}\,(X)+{\hat{\sigma }}_{C}^{2}}}{{\sum }_{i=1}^{N}{y}_{i}},\end{eqnarray} \tag{ 27 }$

i.e., the lower and upper bounds of the confidence intervals are obtained from Equation (27), with α = 1 for a 68% confidence level. Confidence intervals at a different level of probability p (i.e., p = 0.9 or 0.99) can be obtained by multiplying the last term in the numerator of Equation (27) by the proper factor α corresponding to the (1 + p)/2 quantile of an N(0, 1) distribution (e.g., α = 1.65 and 2.58 for p = 0.9 and 0.99, respectively). Continuing with the same example as above, the 68% confidence interval according to Equation (27) is given by f = (0.031, 0.067), and the estimate of the systematic error would be reported as $\hat{f}=0.05{2}_{-0.021}^{+0.015}$ . It is clear that Equation (27) may yield a negative number for the lower limit; that is an indication that no systematic errors are required in that case (i.e., a lower confidence value of zero).

In principle, it may be possible to estimate the unknown σ_int,i parameters following a direct maximum-likelihood method, which depends on the choice of the compounding distribution for the random variable M_i (see Section 4.5). For example, a compounding gamma distribution results in a negative binomial distribution for the data, as discussed in Section 3.3, and one could proceed with a maximum-likelihood method to estimate the overdispersion parameter of the gamma distribution (e.g., following the methods discussed in R. A. Fisher 1941; W. W. Piegorsch 1990; J.-P. Wang 2010). Other methods are also are available to estimate the compounding distribution of a Poisson variable (e.g., H. G. Tucker 1963; L. Simar 1976). The approximate method to estimate the overdispersion parameter presented in this Section, on the other hand, has the advantage of being distribution-independent, provided conditions in Equation (8) on the choice of M_i are met. This method can therefore be used immediately for virtually any application, without the additional complications associated with the use of a direct maximum-likelihood method for this task.

6.3. A Case Study with Astrophysical Data

The methods discussed in this Section are further illustrated with the same data presented in D. Spence et al. (2023) and M. Bonamente (2023a) for the spectra of the quasar 1ES 1553+113. The spectrum used in this case study correspond to the data illustrated in Figure 4 and Table 2 of M. Bonamente (2023a), of which a narrow range is illustrated in Figure 3.

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** Zoom-in on a narrow wavelength range of the XMM–Newton spectra of the source 1ES 1553+113 analyzed by D. Spence et al. (2023). The RGS1 data were shifted upward by 20% for clarity. Vertical error bars are the square root of the number of counts, for illustrative purposes only.
Download figure:
Standard image High-resolution image

The data are representative of a maximum-likelihood regression with Poisson data, in the extensive and large-count regime. Several data points have a significantly lower counts detected (i.e., below 25 Å for RGS2), due to a decrease in the instrument's sensitivity, or gain. Such detector inefficiencies are calibrated and corrected in the mean (e.g., M. Tsujimoto et al. 2011), but residual uncertainties are presumed to contribute to the larger-than-expected variance, following the model of systematics presented in Section 3.1. Figure 3 illustrates how the complex parametric model generally follows the data well, without systematic deviations that could be interpreted as a failure of the model's mean to describe the data. Small-scale fluctuations of the model (of order ≤1%) are due to changes in the sensitivity of the instrument as a function of wavelength, while large-scale trends are a combination of wavelength-dependent sensitivity of the instrument and the genuine spectral distribution of the source.

The data have the following parameters: N = 1526 independent data points, a model with m = 48 free parameters, for a number of degrees of freedom ν = N − m = 1478, and a fit statistic ${C}_{{\rm{\min }}}$ =1862.7. The number of detected counts per bin varied across the wavelength range with ranges 114–1337 for RGS1 and 133–1226 for RGS2, with an average of 728.6 for RGS1 and 756.2 for RGS2 counts per bin; therefore these data are in the large-mean regime. The one-sided p-value of the measured ${C}_{{\rm{\min }}}$ statistic according to the parent distribution χ²(ν) is negligibly small, indicating that the null hypothesis of the Poisson data being drawn from the parent model is formally unacceptable. Following the present model for systematic errors in the Poisson regression to address the larger-than-expected fit statistic, Equation (25) and (27) yield an estimate of $\hat{f}=0.018\pm 0.002$ , meaning that the larger-than-expected ${C}_{{\rm{\min }}}$ statistic is consistent with a systematic error of approximately 2%, which is in fact consistent with known levels of detector systematics (D. Spence et al. 2023).

An alternative analysis of these data consists of the assumption of a given value for the level of systematic errors. If one assumes the same point estimate f = 0.018 based on Equation (25), it is immediate to see that the null-hypothesis probability is 50%: this is by design, according to Equation (24). A more interesting example is to test the hypothesis of a somewhat smaller value, say f = 0.01 or 1% intrinsic variance. Then, Equations (17) and (19) yield a bias parameter ${\hat{\mu }}_{C}=113.2$ and an overdispersion parameter ${\hat{\sigma }}_{C}^{2}=478.1$ , and the parent overdispersed χ² distribution in Equation (21) is approximated with a Normal distribution

$\begin{eqnarray*}&&{C}_{{\rm{\min }},{\rm{sys}}}\unicode{x0007E}N(\nu +{\hat{\mu }}_{C},2\nu +{\hat{\sigma }}_{C}^{2})=N(1591.2,3434.1).\end{eqnarray*}$

The measured ${C}_{{\rm{\min }}}$ value corresponds to a null-hypothesis probability value of p = 0.000002, indicating that this is indeed too small a value of the intrinsic variance, in accord with the estimate above of $\hat{f}=0.018\pm 0.002$ . The same procedure can be used to test any other level of systematic errors.

7. Discussion and Conclusions

This paper has presented a simple statistical method to address the presence of systematic errors in the regression of Poisson data, even when there is no information on the physical source of such uncertainties. The method is based upon the reversal of the role played by systematic uncertainties, which are usually attributed to an overdispersion of the data generating process. Within that context, overdispersed integer-count data my be modeled with a variety of alternative distributions, such as the negative binomial (see, e.g., J. M. Hilbe 2011, 2014). That approach, however, comes at the price of a more complex method of regression and its associated goodness-of-fit measure. Instead, retaining the relatively simpler Poisson distribution for the data and attributing the overdispersion to an intrinsic variance in the model, makes it possible to retain the unbiasedness and relative simplicity of the quasi-maximum-likelihood Poisson regression (see C. Gourieroux et al. 1984b, 1984a; as discussed also in Section 3.3), and obtain a simple generalization of the goodness-of-fit statistic.

The main result of this paper is that the goodness-of-fit statistic ${C}_{{\rm{\min }},{\rm{sys}}}$ for the regression of Poisson data in the presence of systematic errors is the sum of two independent statistics. The first term is the usual Poisson deviance or Cash statistic ${C}_{{\rm{\min }}}$ , with an asymptotic Normal distribution in the extensive data regime (Y. Chen et al. 2025, in preparation). The second is a corrective term Y that is also asymptotically Normal, with simple relations given in Equations (17) and (19) for its mean and variance. As a result, hypothesis testing with this model for systematic errors is simple to implement in a variety of data analysis situations, as illustrated with the numerical simulations of Section 5 and in the case study of Section 6.3.

This new method was motivated by the need for an accurate method to include systematics in the regression of integer-count data, which are ubiquitous across the sciences. In addition to hypothesis testing, this method also permits an estimate of the intrinsic variance from the data, when no information is available on the level of systematic errors. The case study based on the astrophysical spectral data (D. Spence et al. 2023) has illustrated these methods with real-life measurements. The numerical simulations that we have performed for this paper show that the method is accurate in the Poisson large-mean limit, where we have used the chi-squared approximation for the distribution of the Poisson deviance D_P statistic.

Appendix A: Reference Distributions, Compounding, and Other Formulas

A.1. The Poisson and Gamma Random Variables

The probability mass function of a Poisson random variable X ~ Poisson(μ) is

$\begin{eqnarray}&&P(X=n)=\frac{{e}^{-\mu }{\mu }^{n}}{n!},\quad n=0,1,\ldots \end{eqnarray} \tag{ A1 }$

where μ > 0 is usually referred to as the rate of the Poisson distribution. The mean and variance of a X are

$\begin{eqnarray}&&\,{\rm{E}}\,(X)=\,\rm{Var}\,(X)=\mu .\end{eqnarray} \tag{ A2 }$

The probability distribution function (pdf) of a gamma random variable X ~ gamma(α, r) is

$\begin{eqnarray}&&{f}_{\gamma }(x)=\frac{\alpha {(\alpha x)}^{r-1}}{{\rm{\Gamma }}(r)}{e}^{-\alpha x},\quad x\in {{\mathbb{R}}}^{+}\end{eqnarray} \tag{ A3 }$

where Γ is the usual Gamma function and where α, r are positive real numbers. The parameter α is the rate parameter (and 1/α as the scale parameter), and r is the shape parameter. The mean and variance of a gamma(α, r) random variable X are

$\begin{eqnarray}&&\left\{\begin{array}{l}\,{\rm{E}}\,(X)=\displaystyle \frac{r}{\alpha }\\ \,\rm{Var}\,(X)=\displaystyle \frac{r}{{\alpha }^{2}},\end{array}\right.\end{eqnarray} \tag{ A4 }$

with a skewness and kurtosis of, respectively, $\,\rm{skew}\,(X)=2/\sqrt{r}$ and kurt(X) = 3 + 6/r. Other properties of the Poisson and gamma distributions can be found in most textbooks on probability (e.g., R. Hogg et al. 2023; K. Siegrist 1997).

A.2. Compounding or Mixing of Distributions

Many experiments are conveniently described by random variables whose parameters are themselves variables following another distribution. This compounding of random variables is well documented in the literature, such as the Poisson distribution with a gamma-distributed rate parameter, i.e.,

$\begin{eqnarray}&&X\unicode{x0007E}{\rm{Poisson}}(\mu )\,\quad \,\rm{with}\,\mu \unicode{x0007E}{\rm{gamma}}(\alpha ,r),\end{eqnarray} \tag{ A5 }$

which results in a negative binomial distribution (e.g., M. Greenwood & G. Y. Yule 1920; C. I. Bliss & R. A. Fisher 1953).

Compound distributions are sometimes referred to as contagious distribution (e.g., J. Gurland 1958), based on the fact that certain models of diseases feature a (positive) correlation between the number and the probability of occurrence of certain outcomes (see, e.g., M. Greenwood & G. Y. Yule 1920). More recently, these types of distributions have been referred to as mixtures, and they are commonly used in a variety of machine learning applications, such as the EM algorithm (T. Hastie et al. 2009; L. E. Baum et al. 1970).

General formulas for the mean and variance of a compound distribution can be given, under certain circumstances. Let X have a distribution F (i.e., Poisson), and a parameter θ of its distribution (for the Poisson, there is only the μ parameter) be distributed according to G; then the resulting compounded distribution H is such that

$\begin{eqnarray}\begin{array}{rcl}{\,{\rm{E}}\,}_{{H}}(X) & = & {\,{\rm{E}}\,}_{G}({\,{\rm{E}}\,}_{F}(X| \theta ))\\ {\,\rm{Var}\,}_{H}(X) & = & {\,{\rm{E}}\,}_{G}({\,\rm{Var}\,}_{F}(X| \theta ))+{\,\rm{Var}\,}_{G}({\,{\rm{E}}\,}_{F}(X| \theta ))\end{array}\end{eqnarray} \tag{ A6 }$

where the subscript highlights the relevant distribution to be used (e.g., A. M. Mood et al. 1974). In the example of the Poisson distribution with a gamma-distributed rate parameter μ, which is also the mean of the Poisson distribution, this implies

$\begin{eqnarray}&&\left\{\begin{array}{l}{E}_{H}(X)=\mu \\ {\,\rm{Var}\,}_{H}(X)=\mu +\displaystyle \frac{r}{{\alpha }^{2}}=\mu +\displaystyle \frac{\mu }{\alpha }.\end{array}\right.\end{eqnarray} \tag{ A7 }$

The results show that the Poisson distribution, compounded with a gamma-distributed rate parameter, leads to overdispersion relative to the pure Poisson case.

A.3. Other Formulas of Interest for D_P

In general, there is no guarantee that the sum of the deviations ${\sum }_{i=1}^{N}({\hat{\mu }}_{i}-{y}_{i})$ in a Poisson regression is equal to zero. When an intercept is used in a GLM with the canonical logarithmic link, i.e., $\eta =\mathrm{ln}\mu$ , then it is immediate to see that the sum of the ML Poisson deviations does in fact equal zero. In that case, the deviance statistic simplifies to

$\begin{eqnarray}&&{D}_{P}=2\displaystyle \sum _{i=1}^{N}{y}_{i}\mathrm{ln}({y}_{i}/{\hat{\mu }}_{i}),\end{eqnarray} \tag{ A8 }$

and it is defined as such in certain applications (e.g., L. A. Goodman 1969). However, given the emphasis on nonlinear models that may or may not have an intercept, it is convenient to retain the usual expression of Equation (2) that applies in general.

Appendix B: Estimates of Moments of the Y Statistics

This Appendix describes the evaluation of the moments of the Y statistic, which was defined in Equation (13).

B.1. Definition

The Y statistic is defined by

$\begin{eqnarray}\begin{array}{rcl}Y & := & {C}_{{\rm{\min }}}({M}_{i})-{C}_{{\rm{\min }}}(\hat{{\mu }_{i}})\\ & = & 2\displaystyle \sum _{i=1}^{N}({M}_{i}-{\hat{\mu }}_{i})-{y}_{i}\mathrm{ln}\left(\frac{{M}_{i}}{{\hat{\mu }}_{i}}\right),\end{array}\end{eqnarray} \tag{ B1 }$

where M_i is a random variable with mean ${\hat{\mu }}_{i}$ and variance ${\sigma }_{{\rm{int}},i}^{2}$ ; see Equation (8). In Equation (B.1), ${\hat{\mu }}_{i}$ is the ML estimate of the mean of the N independent Poisson data points y_i. The goal is to estimate the mean E(Y) and the variance Var(Y) as a function of the fixed parameters $\hat{{\mu }_{i}}$ and σ_int, and as a function of the data y_i or its parent means μ_i.

The Y random variable is approximated via the usual McLaurin series for the logarithmic term,

$\begin{eqnarray}\begin{array}{rcl}Y & = & 2\displaystyle \sum _{i=1}^{N}({M}_{i}-{\hat{\mu }}_{i})-{y}_{i}\left(\frac{{M}_{i}-{\hat{\mu }}_{i}}{{\hat{\mu }}_{i}}\right)+\frac{1}{2}\,{y}_{i}{\left(\frac{{M}_{i}-{\hat{\mu }}_{i}}{{\hat{\mu }}_{i}}\right)}^{2}\\ & & +o\left({\left(\frac{{M}_{i}-{\hat{\mu }}_{i}}{{\hat{\mu }}_{i}}\right)}^{2}\right).\end{array}\end{eqnarray} \tag{ B2 }$

With ${M}_{i}-{\hat{\mu }}_{i}$ a zero-mean variable of variance ${\sigma }_{{\rm{int}},i}^{2}$ , and ${f}_{i}={\sigma }_{{\rm{int}},i}/{\hat{\mu }}_{i}\ll 1$ also by design, the expectation of the higher-order terms is much small than the second-order term. Therefore, ignoring higher-order terms provides an accurate approximation, and hereafter the Y statistic is approximated by its first three terms as written in Equation (B2).

B.2. Approximations

It is well known that, under mild conditions on the properties of the parametric model, the ML parameters $\hat{\theta }$ are asymptotically Normally distributed,

$\begin{eqnarray}&&\sqrt{N}(\hat{\theta }-\theta )\mathop{\to }\limits^{d}N(0,{A}^{-1})\end{eqnarray} \tag{ B3 }$

where A is the Fisher information matrix, and its inverse the usual covariance or error matrix. A common method to estimate the parameter variances is via the Hessian estimator $\,\rm{Var}\,(\hat{\theta })$ , which evaluates the information matrix at $\theta =\hat{\theta }$ without taking the expectations. An equivalent $\sqrt{N}$ -dependence applies to any function of the parameters, in particular to the means μ_i(θ), whereas the approximate delta method, usually referred to as the error propagation method in the physical sciences, leads to

$\begin{eqnarray}&&\,\rm{Var}\,({\hat{\mu }}_{i})={\left.\displaystyle \frac{\partial {\mu }_{i}{(\theta )}^{T}}{\partial \theta }\right|}_{\hat{\theta }}\,\rm{Var}\,(\hat{\theta })\,{\left.\displaystyle \frac{\partial {\mu }_{i}(\theta )}{\partial {\theta }^{T}}\right|}_{\hat{\theta }}\end{eqnarray} \tag{ B4 }$

where θ is a column vector, and T denotes its transpose (for textbook references, see, e.g., P. R. Bevington & D. K. Robinson 2003; L. Wasserman 2010). For example, for the constant model considered in the Monte Carlo simulations of Section 5, the partial derivatives in Equation (B4) are trivially equal to the unity scalar, while for the usual linear model y = a + b x with θ = (a, b), it is ∂μ_i/∂θ = (1, x_i), where x_i is the fixed coordinate for the y_i measurement.

For extensive data, the standard deviation of the ML in means Equation (B4) therefore decreases as $\sqrt{N}$ , and accordingly, we approximate ${\hat{\mu }}_{i}$ as fixed values for the purpose of estimating the moments of Y. It is worth pointing out that the variability of ${\hat{\mu }}_{i}$ is duly accounted for in the distribution of the X statistic in Equation (12), which represents the usual Poisson deviance D_P without systematic errors. This assumption was tested with the Monte Carlo simulations presented in Section 5.3.

B.3. Evaluation of the Moments

The assumption of independence between M_i and y_i, by construction, and the assumption that ${\hat{\mu }}_{i}$ have negligible variance in the extensive data regime (see above) simplify considerably the evaluation of the moments of Y.

The approximate estimates for the mean of Y are therefore given by

$\begin{eqnarray}&&\,{\rm{E}}\,(Y)=2\,\,{\rm{E}}\,\left[\displaystyle \sum _{i=1}^{N}\mathop{\underbrace{({M}_{i}-{\hat{\mu }}_{i})\left(1-\displaystyle \frac{{y}_{i}}{{\hat{\mu }}_{i}}\right)}}\limits_{(A)}+\mathop{\underbrace{\displaystyle \frac{1}{2}\,{y}_{i}{\left(\displaystyle \frac{{M}_{i}-{\hat{\mu }}_{i}}{{\hat{\mu }}_{i}}\right)}^{2}}}\limits_{(B)}\right],\end{eqnarray} \tag{ B5 }$

which simply requires the given first- and second-order moments for the M_i variable, respectively, for terms (A) and (B). It therefore immediately follows that

$\begin{eqnarray}&&\,{\rm{E}}\,(Y)=\displaystyle \sum _{i=1}^{N}{\mu }_{i}{\left(\displaystyle \frac{{\sigma }_{\mathrm{int},i}}{{\hat{\mu }}_{i}}\right)}^{2}\simeq \displaystyle \sum _{i=1}^{N}{y}_{i}{\left(\displaystyle \frac{{\sigma }_{\mathrm{int},i}}{{\hat{\mu }}_{i}}\right)}^{2},\end{eqnarray} \tag{ B6 }$

which is reported as Equation (17). Notice that the expectation of (A) is null regardless of the approximation of a constant ${\hat{\mu }}_{i}$ , since $\,\rm{E}\,({M}_{i}-\hat{{\mu }_{i}})=0$ . The second approximate equality is exact if the model y = f(x) has an intercept, as do both the constant and the linear models (and if ${f}_{i}={\sigma }_{{\rm{int}},i}/\hat{{\mu }_{i}}$ is constant for all data points), since the ML Poisson regression in that case satisfies ∑(μ_i − y_i) = 0. This approximation applies to any distribution for M_i, provided the mean and variance are set according to Equation (8).

For the variance, we start by evaluating the moment of second order

$\begin{eqnarray}&&\begin{array}{l}\,{\rm{E}}\,\left({Y}^{2}\right)=4\,\,{\rm{E}}\,\left[{\left(\displaystyle \sum _{i=1}^{N}({M}_{i}-{\hat{\mu }}_{i})\left(1-\displaystyle \frac{{y}_{i}}{{\hat{\mu }}_{i}}\right)+\displaystyle \frac{1}{2}\,{y}_{i}{\left(\tfrac{{M}_{i}-{\hat{\mu }}_{i}}{{\hat{\mu }}_{i}}\right)}^{2}\right)}^{2}\right]\\ \quad =4\displaystyle \sum _{i=1}^{N}{\,{\rm{E}}\,}_{\mathrm{int}}\left[{({M}_{i}-{\hat{\mu }}_{i})}^{2}\right]\cdot {\,{\rm{E}}\,}_{P}\left[{\left(1-\displaystyle \frac{{y}_{i}}{{\hat{\mu }}_{i}}\right)}^{2}\right]\\ \qquad +[\,\rm{cross\unicode{x02013}product\,terms\,of\,(A)}\,]\\ \qquad +\displaystyle \frac{4}{4}\displaystyle \sum _{i=1}^{N}{\,{\rm{E}}\,}_{P}\left({y}_{i}^{2}\right)\cdot {\,{\rm{E}}\,}_{\mathrm{int}}\left[{\left(\displaystyle \frac{{\mu }_{i}-{\hat{\mu }}_{i}}{{\hat{\mu }}_{i}}\right)}^{4}\right]\\ \qquad +\displaystyle \frac{4}{4}\displaystyle \mathop{\displaystyle \underbrace{\displaystyle \sum _{i=1}^{N}\displaystyle \sum _{\begin{array}{l}j=1\\ j\ne i\end{array}}^{N}{\,{\rm{E}}\,}_{P}\left({y}_{i}{y}_{j}\right)\cdot {\,{\rm{E}}\,}_{\mathrm{int}}\left[{\left(\displaystyle \frac{{\mu }_{i}-{\hat{\mu }}_{i}}{{\hat{\mu }}_{i}}\right)}^{2}{\left(\displaystyle \frac{{\mu }_{j}-{\hat{\mu }}_{j}}{{\hat{\mu }}_{j}}\right)}^{2}\right]}}\limits_{(C)}\\ \qquad +[\,\rm{cross\unicode{x02013}product\,terms\,between\,(A)\,and\,(B)}\,],\end{array}\end{eqnarray} \tag{ B7 }$

where the subscript "P" indicates the Poisson distribution for the data y_i, and the subscript "int" represents the distribution for the M_i variable, which is assumed independent from the data. The cross-product terms of (A) and those between (A) and (B) vanish because of independence among the N terms and the property ${\,\rm{E}\,}_{{\rm{int}}}[{\mu }_{i}-{\hat{\mu }}_{i}]=0$ .

The cross-product terms of (B) for i ≠ j, indicated by C, survive as the indicated double sum, which by independence evaluates to

$\begin{eqnarray*}\begin{array}{rcl}C & = & \displaystyle \sum _{i=1}^{N}{\mu }_{i}{\left(\frac{{\sigma }_{{\rm{int}},i}}{{\hat{\mu }}_{i}}\right)}^{2}\cdot \displaystyle \sum _{\begin{array}{c}j=1\\ j\ne i\end{array}}^{N}{\mu }_{i}{\left(\frac{{\sigma }_{{\rm{int}},j}}{{\hat{\mu }}_{j}}\right)}^{2}\\ & = & \,\rm{E}\,{\left[Y\right]}^{2}-\displaystyle \sum _{i=1}^{N}{\mu }_{i}^{2}\,{f}_{i}^{4}\end{array}\end{eqnarray*}$

using Equation (B6) above, with the negative correction term arising from the j = i terms that were missing in the second sum, and indicating as usual ${f}_{i}={\sigma }_{{\rm{int}},i}/{\hat{\mu }}_{i}$ .

It therefore follows that the general expression for the approximation to the variance of Y according to the M_i distribution is

$\begin{eqnarray}\begin{array}{rcl}\,\rm{Var}\,(Y) & = & 4\displaystyle \sum _{i=1}^{N}{\mu }_{i}{f}_{i}^{2}+\displaystyle \sum _{i=1}^{N}({\mu }_{i}^{2}+{\mu }_{i})\,{f}_{i}^{4}\cdot \,\rm{kurt}\,({M}_{i})\\ & & -\displaystyle \sum _{i=1}^{N}{\mu }_{i}^{2}{f}_{i}^{4},\end{array}\end{eqnarray} \tag{ B8 }$

which depends on the kurtosis of the distribution for M_i. For Normal ${M}_{i}\unicode{x0007E}N({\hat{\mu }}_{i},{\sigma }_{{\rm{int}},i}^{2})$ , kurt(M_i) = 3. For a gamma distribution of the type considered in Section 4.5,

$\begin{eqnarray*}&&\,\rm{kurt}\,({M}_{i})=3+\frac{6}{r}=3+6{\left(\frac{{\sigma }_{{\rm{int}},i}}{\hat{{\mu }_{i}}}\right)}^{2}=3+6\,{f}_{i}^{2},\end{eqnarray*}$

where the excess kurtosis relative to the Normal distribution provides a small correction, since f_i ≪ 1.

A simplified expression for Var(Y) could have been obtained from Equation (B8) with the assumption that y_i are constant, which would operationalize the variance of Y as

$\begin{eqnarray}\begin{array}{rcl}\,\rm{Var}\,(Y) & = & 4{\sum }_{i=1}^{N}{y}_{i}\,{f}_{i}^{2}+{\sum }_{i=1}^{N}{y}_{i}^{2}\,{f}_{i}^{4}\cdot \,\rm{kurt}\,({M}_{i})\\ & & -{\sum }_{i=1}^{N}{y}_{i}^{2}\,{f}_{i}^{4}.\end{array}\end{eqnarray} \tag{ B9 }$

It is therefore reasonable to use either Equation (B8) with the replacement of μ_i with y_i, or Equation (B9), as numerical estimators of the variance of Y. In the large-mean regime that is assumed throughout, differences between the estimators are expected to be negligible.