Estimation of heavy tails in optical non-linear processes

In optical non-linear processes rogue waves can be observed, which can be mathematically described by heavy-tailed distributions. These distributions are special due to the fact that the probability of registering extremely high intensities is significantly higher than for the exponential distribution, which is most commonly observed in statistical and quantum optics. The current manuscript gives a practical overview of the generic statistics toolkit concerning heavy-tailed distributions and proposes methods to deal with issues specific to non-linear optics. We take a closer look at supercontinuum generation, where rogue waves were already observed. We propose modifications to the Hill estimator to deal with detector saturation as well as corrections introduced by pumping the process by bright squeezed vacuum. The suggested methodology facilitates statistically reliable observation of heavy-tailed distribution in non-linear optics, nanooptics, atomic, solid-state processes and optomechanics.


Introduction
Heavy-tailed distributions, and power-law (or Pareto) distributions in particular have been reported from a very broad range of areas, including earthquake intensities [1][2][3], avalanche sizes [4], solar flares [5], degree distributions of various social and biological networks [6][7][8], incomes [9,10], insurance claims [11,12], number of citations of scientific publications [13][14][15], and many more. For financial institutions, the importance of heavy-tailed behavior comes from the fact that a simple Gaussian model severely underestimates the risks associated with different products or investment strategies, which in turn results in considerable losses. This is why the mathematical background of heavy-tailed distributions and their estimation have been most extensively studied in this context ( [16][17][18][19] and others).
For physicists, in turn, the general fascination with power laws comes from the concept of universality classes in statistical physics, which describe the behavior of various systems close to their critical points, irrespective of the details of the mechanisms of the systems [20]. Inspired by this, and using the increasing availability of good-quality data sets, many of the field have ventured into interdisciplinary waters where power-law behavior has been reported; thereby creating entirely new disciplines like econophysics [21,22]. In such investigations, the emphasis has been mostly on understanding the emergence of power-law behavior [23][24][25]. In contrast, the problem of practical estimation methods of the exponent associated with the power-law (or indeed the verification of power-law behavior) has received relatively little attention from the community. Notable exceptions are [8,26], in which the power-law nature of numerous phenomena has been questioned.
Turning to non-linear optics [27,28], the problem formulation is, however, somewhat different. Optical experiments producing light with heavy-tail intensity distributions allow high repetition rates and large samples to study unstable non-linear phenomena and their sources [29,30]. Moreover, unlike in social or financial contexts, experiments can be repeated. For intensive pumps, non-linear systems can produce bright signals with directly measurable intensity fluctuations. In statistical optics [31], coherence theory and quantum optics [32], the probability density function of intensities is a subject of investigation for intensive light beams. Remarkably, it allows a direct observation of macroscopic quantum phenomena [33][34][35][36][37]. Intensity distributions in supercontinuum generation setups [38][39][40] being heavy-tailed -hence the term rogue waves -does have a theoretical basis. So our aim is not to decide whether the observed distribution is heavy-tailed or not. We rather propose ways to estimate the tail exponent of the distribution in the presence of experimental imperfections, which can help in experimental control and design. The issue of detector saturation, for example, should be paid special attention to since for intensity distributions with extremely heavy tails, it cannot be solved by a simple re-calibration. There is always going to be some portion of the data that will be affected by saturation. Low-intensity statistics, on the other hand, are mostly determined by background noise. Consequently, there is only a limited interval of intensities within which the observations can be used for estimation purposes and this hugely affects the efficiency of any estimation procedure. If we are unlucky, the intervals affected by the background noise and detector saturation overlap, and the experiment cannot be salvaged. If, however, there is a portion that is useful, we propose ways to gain an initial estimate of the tail exponent (and by extension, higher quantiles) which helps design further measurement setups. The procedures presented in the current manuscript can be used to quantify the instabilities resulting in extreme events common in optical non-linear processes [41,42], and potentially also in non-linear optomechanics [43,44], four-wave mixing in atomic vapors [45,46] and wave-mixing processes in superconducting circuits [47][48][49]. Beyond fundamental interest, extreme events can be used to produce highly non-classical effects distillable to large squeezing and entanglement for quantum technology [50,51]. In parallel, they also inspire an investigation of optical Maxwell demon principle [52] for heavy-tailed distributions.
The article is structured as follows: In Section 2, we give a brief overview of the mathematical background and clarify terminology. Section 3 describes a generic estimation toolkit with pointers to more sophisticated methods. Section 4 proposes variations of the generic methods for evaluating numerical data in non-linear optics specifically.

Background and terminology
Let us first provide an overview of the terminology and concepts concerning heavy-tailed distributions, used throughout this work.
In general, distributions that decay slower than exponential are referred to as heavy-tailed. To give a clear mathematical formulation of this concept, it is useful to introduce the tail function (also referred to as survival function, or complementary distribution function), defined for an arbitrary real-valued random variable X as In other words, this function describes the probability that the variable reaches or exceeds a threshold x. It is related to the more familiar distribution function (DF) F (x) through F (x) = 1 − F (x), and to the probability density function ( In what follows, we will only consider the right tail of distributions, that is, the behavior of the largest values, and for the sake of simplicity, suppose that we are dealing with positive-valued random variables like optical intensities. The whole treatment can be straightforwardly extended to the left tails of distributions. Using the tail function (TF), a heavy-tailed distribution can be defined as a distribution for which There are other equivalent definitions [53], we prefer the latter formulation since it is the one which is most in line with the somewhat vague notion of an "L-shaped" distribution used in connection with rogue waves ( [54], for example). Note that if the limit C in (1) is a finite positive value, the distribution decays asymptotically at an exponential rate, while if it is infinity, the distribution decays faster than exponential. Definition (1) includes even distributions whose moments of any order are finite, for example the log-normal distribution and the Weibull distribution with a shape parameter lower than one [55]. Pareto-type distributions (or regularly-varying distributions) form a subset of heavy-tailed distributions and are defined as with L(x) being a slowly-varying function (for any t > 0, lim x→∞ L(tx)/L(x) = 1; slowly-varying functions include for example the logarithm function and functions that have a finite limit in infinity) and α > 0. For the Pareto distribution, L(x) = const, corresponding exact power-law behavior. The exponent α is referred to as the tail exponent. Note that this exponent describes the decay of the tail function, the exponent corresponding to the decay of the PDF is α + 1. The moments E (X a ) are finite for all a < α and infinite for all a > α. Whether E (X α ) itself is finite depends on the function L(x).
Exponential [41] Gamma [41] Log-Pareto [60] Max domain of attraction Log-normal [61] α = 0 Heavy-tailed: C = 0 Uniform [62], Beta [63] Pareto [9] Cauchy [64] Levy [65] Log-gamma [66]  What makes heavy-tailed distributions statistically special is the fact that many traditional procedures based on the mean and the variance are inapplicable if the first and second moments do not exist. It is possible, of course, to compute sample averages, but they are meaningless if the corresponding expected value is not finite for the underlying distribution. That is, as the number of observations is increased, such sample averages do not converge to a finite number. The lack of definite moments limits many evaluations in statistical optics [31], coherence optics and quantum optics [32]. One should not, for example, calculate second-order quantities like correlation [42,[56][57][58] for α < 2. Furthermore, the traditional central limit theorem, which lies at the heart of many applied models, is not applicable, either.
The question what can be said about the largest observations for such distributions arises quite naturally. There are limit laws concerning the maxima of samples and the distribution of threshold exceedances. Very simply stated, if a distribution has the form (2), the distribution of sample maxima tends to an extreme value distribution described by the DF exp −(1 + γx) −1/γ , with 0 < γ = α −1 < ∞. The distribution of the exceedances of a sufficiently large threshold l (that is, the random variable X − l | X > l) tends in turn to a generalized Pareto distribution with DF 1 − (1 + γx) −1/γ , γ = α −1 . Both limit laws hold, of course, given proper normalizing constants, for details see [59]. These limit laws can be straightforwardly used to model the behavior of the underlying distribution beyond the largest observed value.
There are heavy-tailed distributions whose tails are heavier than (2) (for example the log-Pareto constructed by exponentiating a Pareto-distributed variable), these limit laws do not apply to them. There are also heavy-tailed distributions whose tail is lighter than (2), for these the DF of the maxima tends to e −e −x , whereas threshold exceedances tend to an exponential distribution; these distributions arise as the natural limit for γ = 0. Distributions with a light, but infinite tail also belong to the γ = 0 domain of attraction. The limit distribution of the maxima of random variables with a finite upper bound is also exp −(1 + γx) −1/γ , but with γ < 0. Figure 1 summarizes this categorization according to tail heaviness.
In what follows, we will deal with the extreme value index γ instead of the tail exponent α = γ −1 . This has two practical reasons: firstly, γ is the quantity that is used in all mathematical publications, so the quite extensive mathematical theory is based on that quantity; and secondly, in the context of supercontinuum generation, γ is the quantity proportional to the mean intensity of the pump, so it is in a way more handy than α.

Basic tools for investigating power-law tails
The purpose of this section is to show physicists some simple, visual tools for assessing power-law behavior and also give references on improving the behavior of the estimators. The reason why we have not picked a single favorite estimator is that in order to have a reliable assessment of power-law behavior, it is better to use more than one tool and see whether they produce consistent results.
Examples: For demonstration purposes we will use in the next sections computer-generated samples of different distributions; their properties are summarized in Table 1. The size of the generated samples was 10 4 . We did not use solely regularly-varying distributions (2) because we would also like to show what the tools produce when used with distributions that do not have a power-law tail. The exponential distribution is a standard thin-tailed distribution, compared to which the heavy-tailed property itself is defined. The log-normal distribution is, according to the definition (1) heavy-tailed, but it does not have a finite tail exponent. A log-gamma distributed variable can be created by exponentiating a gammadistributed variable the same way one transforms a normally distributed variable into a log-normal; it is an example for a regularly-varying distribution with L(x) = const. The Pareto distribution corresponds to pure power-law behavior, and is used for demonstrating the best-case scenario (it is also the a = 1 special case of the log-gamma distribution).

Histogram
Preparing the histogram is probably the most wide-spread way to visualize random samples, and is definitely go-to tool for many physicists. This is why we devote more attention to it in this paper than it deserves from the mathematical point of view. It involves defining a discrete set of bins over the number line and then counting how many observations of the random variable fall in each bin. Given a linear set of bins, the histogram provides an estimate of the PDF of the underlying distribution. If this distribution is, at least approximately, Pareto, then the histogram should be linear on a log-log plot with the absolute value of the slope equal to the tail exponent plus one. One can even perform a least-squares fit to obtain the slope of the line to estimate the exponent. There are, however, three major problems with this approach: 1. Due to the power-law nature, there will be only a few observations (if any) on the right end of a linear set of bins, meaning that exactly for large values, where the power-law property itself should be more pronounced, there will be considerable variance.
2. As taking the logarithm and the expectation are not interchangeable, the expected value of the logarithm of the frequencies is not equal to the logarithm of the PDF. Furthermore, the variance of the frequencies depends on the location. So the basic prerequisites of an ordinary least squares fit do not hold.
3. The choice of bins has a marked effect on the outcome.
To help with the first problem, one can use logarithmic bins instead of linear ones. However, it is important to be aware of the fact that using logarithmic bins, that is, bin sizes that are linear on a logarithmic scale is equivalent to preparing a linearly binned histogram of Y = ln X. Consequently, this version of the histogram approximates the probability density function of Y , g(y). This change of x Impr.: 3.00 · x −3.00 Lin. bins Log. bins Relative RMSE (%) # of largest observations taken into account 2 3, naive 3, impr. 5, naive 5, impr. 9, naive 9, impr. variables results in g(ln x) = x · f (x), meaning that the slope of the linearly binned histogram is α + 1, while the slope of the logarithmically binned version is α instead, see figure 2(a).
Concerning the second point, if one does not take the logarithm of the frequencies, there is no problem with expectations, and the variances can be calculated explicitly. A consistent version of the histogram approach is preparing the histogram of Y = ln X, and performing a weighted least-squares fit of A · e −By . The weights are needed to make the data homoscedastic [67], and can be calculated as −1 , withp k denoting the fraction of observations within the k th bin. The choice of bins in this setting is especially problematic, since empty bins should to be avoided (or, alternatively, discarded when performing the fit).
The third issue cannot be completely eliminated, however, figure 2(b) shows the results of a smallscale simulation experiment based on purely power-law samples (corresponding to the best-case scenario). We have taken N samples = 10 4 samples, each consisting of N = 10 4 elements. For each sample, for k = 1, 2, . . . , N we have calculated the values of the estimators based on the top k observations and compared them to the known value of the extreme value index γ to obtain with m denoting the method used. This means that for a given value of k, we prepared 4 histograms (2, 3, 5, and 9 bins), and calculated the estimate using the naive (least-squares linear fit on log-log scale) and the improved version (weighted least squares exponential fit on lin-log scale) of the estimator as well for either. First of all, the simulation showed how bad the performance of the naive histogram estimator really is. Secondly, and probably a little counter-intuitively, it showed how few bins are required in order to minimize the error of either version of the estimator based on the histogram compared to how many bins one would use for visualization. Considering, however, that one only needs to estimate a single parameter, increasing the number of points does not necessarily help if these in turn become less accurate. To get an estimate of how many bins should be used for a given number of observations k, one can consider the Rényi representation theorem [68]. Using it, it is straightforward to show that the difference between the log-spacing between the largest and smallest element of a purely Paretodistributed sample of size k is α −1 · [1 + 1/2 + . . .
Euler-Mascheroni constant. Furthermore, supposing that there is an ideal (log) bin width that minimizes the RMSE of a histogram estimator independently of the sample size, it has to be proportional to α −1 (since that is the scale parameter of the logarithmically transformed sample), so the total number of bins should be a linear function of ln(k − 1). Based on our simulations, for k = 100, 3 bins are ideal, for k = 10 4 5 bins produce the best results using the improved version of the histogram estimator (see figure 2(b)). Interestingly, for the naive version it seems to be always better to prepare just 2 bins. All in all, the histogram need not be completely discarded as a tool even when working with heavytailed distributions, provided that certain changes are applied to the naive estimator: use logarithmic binning, use only a few bins (if the histogram resembles a broomstick-like in figure 2(a), reduce the number of bins), and use weighted least-squares to fit an exponential to the logarithmically transformed data. Nevertheless, the histogram remains first and foremost a visual tool in the heavy-tailed context.

Empirical tail function and QQ estimator
The empirical tail function (ETF) presents a simple alternative to histograms as a visual tool that does not depend on an arbitrary binning procedure. Given a sample of independent, identically distributed , the empirical tail function is given as or equivalently, for an arbitrary threshold l ∈ R, with 1{·} denoting the indicator function. In other words, one has to check what proportion of the observations exceed the limit l; l = x (k) yields the first definition, equation (3). If the sample is powerlaw distributed (at least for the largest observations) the ETF should be linear on a log-log plot (again, for the largest observations), with the slope equal to −α (see figure 3). Since the TF is the integral of the PDF, the ETF is considerably smoother than the histogram, which makes it easier to detect deviations from power-law behavior visually.
As for a numerical estimate of α, it is, of course, possible to perform a least-squares fit on ln x (k) , ln k N (and that also yields much better results than the naive histogram approach), however, it is better to do this with the roles reversed, that is, by treating ln k N as the independent variable. The latter procedure is referred to as the QQ estimator [69], with QQ standing for quantile-quantile. The essential difference is that in the ETF version, one divides by the sample average ln 2 x (k) − ln x (k) 2 , and in the QQ version a deterministic number, ln 2 k N − ln k N 2 . The QQ version is therefore more stable.
In general, a QQ plot is constructed by plotting the sorted observations against the matching quantiles of the theoretical distribution. If the underlying distribution matches the theoretical distribution (up to a linear transformation), then the QQ plot should be linear. In the specific case of a Pareto-distributed variable, one transforms it into an exponentially distributed variable by taking the logarithm and plots the sample as a function of the standard exponential quantiles, that is, the plot consists of the points − ln k N , ln x (k) and the slope of the line is α −1 . In essence, this corresponds to exchanging the two axes of the ETF plot on a log-log scale. The QQ estimator of γ = α −1 is then calculated as the slope of the LS fit on the QQ plot. This is a very simple estimator, and it has been shown to be weakly consistent [70]. However, the issues present in the naive histogram estimator are present here, too. Namely the expected value of ln X (k) is not −α −1 · ln k N , and the variance of ln X (k) depends on k. What is more, X (i) and X (j) are not independent. The problem has been addressed in [47] for purely Pareto distributed samples, taking into account both the issue of expectations and the covariance matrix of order statistics. The authors show that the solution of the generalized regression problem is equivalent to the Hill estimator discussed in section 3.3, which should not come as a surprise for since it is the maximum likelihood estimator. The same issue has been addressed in [71], under more general assumptions on the underlying distribution, yielding an improved regression estimator. Nevertheless, even the basic QQ estimator is also a viable, although usually inferior, alternative to the Hill estimator.

Hill plot
The Hill estimator can be obtained as the conditional maximum likelihood estimator of the reciprocal of the tail exponent α −1 for Pareto-distributed data [72]: with x (i) , i = 1, 2, . . . , N denoting the i th largest element of the sample. In a more general setting, given a sufficiently large number of observations from a regularly-varying distribution, it converges to the extreme value index of the distribution (the rate of convergence depends on the distribution) [73]. For pure Pareto(α) data, E γ H (k) = α −1 , and RMSE γ H (k) = α −1 / √ k. If a random variable has a power-law distribution, taking the logarithm of it transforms it into an exponentially distributed random variable, so another way of interpreting the Hill estimator is the following: it first transforms the power-law sample into an exponential one, and after that it estimates the parameter of this exponential by taking the sample average, using the fact that for X ∼ EXP(λ), E (X|X > l) = l + λ −1 for any l ∈ R + , which is sometimes referred to as the ageless or memoryless property of the exponential distribution. A third interpretation was given in the context of generalized regression in [47].
The term Hill plot stands for plottingγ H as a function of the tail length k. If we know that the data is exactly power-law, it is of course best to set k = N − 1, so the Hill plot does not make much sense. For real-life data, however, it is usually only the largest observations for which power-law behavior is a reasonable approximation. Consequently, one would normally look for a plateau in the Hill plot where  the estimate is relatively stable, and choose the number of points to take into account based on that. Drees et al. [74] have shown that it is best practice to plot the Hill estimator as a function of ln k instead of k, for which they coined the phrase altHill plot (figure 4). Taking a closer look, for a general regularly-varying distribution (2) the presence of the function L(x) complicates the situation in two ways: • It introduces a non-zero bias which becomes zero only asymptotically. How fast it becomes zero depends on the nature of corrections to the power-law behavior.
• The optimal tail length k opt which minimizes the mean squared error of the estimator becomes (often considerably) smaller than the sample size.
See for example figure 4(c), showing the Hill plots for log-gamma distributed samples, which have a logarithmic correction to the power-law. Having information about L(x) is therefore very useful: one can modify the Hill estimator to reduce its bias, and also estimate the optimal tail length k opt . The statistics literature contains several approaches to bias reduction as well as the choice of tail length (see section 4.4 or [59] and references therein), but in general, the more information one has about the higher-order behavior of the distribution the better, as estimating higher-order parameters is of course even more problematic than estimating the extreme value index γ itself.
Another issue with the Hill estimator is that it definitely produces a positive number, regardless of the underlying distribution, see for example figure 4(b), where the true extreme value index γ is zero for the log-normal distribution. Generally, the closer the estimateγ is to zero, the less confident one should be in the results, even if the histogram seems straight at the end. As a rule of thumb, if one has an estimateγ < 0.3, it is a good idea to consider alternative models for the data (unless the theoretical background is clear and indicates a power-law behavior without reason for doubt).
There are further, still basic tools, like the mean excess plot [75,76], which is quite popular in the actuarial field, however, but are not as well suited for estimating the value of the tail exponent.

Parameter estimation for intensity measurements
The previous section showed that in the ideal case, either approach yields results consistent with the true tail exponent. Actual experiments, however, are never that simple, the distortions and noise introduced by the apparatus require further considerations. There are two major effects that cannot be escaped: detector imperfections (limited linear response) and noise added by different experimental elements. We will use a very simplistic model of the experiments done in [41] in order to show how these affect estimation.

Models used in simulations
Manceau et al. [41] showed that for supercontinuum generation setups, the distribution of measured intensities has very heavy tails.  According to Equation (1), each experimental setup produces heavy-tailed observables, even though the pumps are not heavy-tailed (the gamma distribution also decays asymptotically at an exponential rate). Note that for optical harmonic generation, even though the decay is slower than exponential, each moment exists, meaning that there is no theoretical issue with calculating second (or higher) order correlations like g (2) [32], so we are not going to further concern ourselves with that case. For the supercontinuum setup, however, the situation is quite different: depending on the value of λ (or β), which is inversely proportional to the pump's mean intensity, the highest existing moment can be arbitrarily small.
If the models in Table 2 were not only approximations of the reality, one would only need to take the inverse of the non-linear transformation involved in order to obtain a vanilla exponential or gamma sample and quite simply estimate the single free parameter of those distributions. Experimental imperfections do, however, complicate the situation. For illustration, let us consider the following simple model of the supercontinuum-generation process: with I denoting the pump's intensity (whose distribution is either EXP(λ) or GAMMA (1/2, b)) and X the detector's output. The detector is modeled through with ω 1 , ω 2 denoting independent Gaussian noises on the pump's and the detector's side, respectively (these can be aggregates of different types of noises); K is a constant factor. The value of l corresponds to the detector's noise floor, the value of L to its saturation limit. The effects of the different elements of the model are shown in figure 5(a). The dash-dotted light blue line shows the ideal case, i.e., the TF of X 0 ≡ K · sinh 2 I which is distorted in the following steps: Adding a Gaussian noise prior to the non-linear transformation (X 1 ≡ K · sinh 2 [I + ω 1 ]) corresponds asymptotically only to a multiplication by a constant factor exp λσ 2 1 (see dark blue dash-dotted line). Introducing a lower cutoff l introduces a discontinuity in the TF, which is smeared by the second Gaussian noise ω 2 (X 2 ≡ max K · sinh 2 [I + ω 1 ] , l + ω 2 ). The upper cutoff in L introduces a sharp fall to zero in L, which could be smoothed by taking a more accurate model of the saturation curve. However, for the sake of simplicity, the final step is just X = min max K · sinh 2 [I + ω 1 ] , l , L + ω 2 . For the specific parameters used in figure 5, detector saturation essentially corresponds to discarding about the top 1% of observations, shown in the figure in light red. Note that out of the parameters of the model, only the pump's mean intensity has an impact on the tail exponent (if there was a multiplicative distortion prior to the non-linear transformation, the situation would be different). Figure 5(b) illustrates how much easier it is to notice if the saturation limit of the detector was reached during the experiment if one uses the ETF instead of the histogram to visualize the data: while there is a very visible cut-off at the saturation limit for the former, the latter shows only a slight increase in the rightmost bin. Note that in this model, there is a non-zero probability of negative observations, which is why the ETF is not go to one as x goes to zero.
In the following sections, we will examine how the different estimation tools perform for the different versions of our model.

Numerical results for supercontinuum generation
Using figure 5(a), we have gained an insight into what the individual parameters of our model do. Now we are looking at the performance of the three basic estimators discussed in section 3, and how they are affected by model parameters. Figure 6 shows the root mean square error of the different estimators for the thermal case, calculated using 10 3 simulated samples of size 10 4 . Clearly, in this case the convergence to power-law is quick, since the ideal RMSE (black dashed line) is reached easily by the Hill estimator, with the other two estimators performing only a little worse (the minimum is about 4% instead of 3%). This is easily understood when examining the asymptotic form of the TF (which has a closed analytic form for σ 2 = 0), which means that the convergence to the asymptotic behavior is also power-law. The divergence in the bias is caused by the fact that observations for this model can get arbitrarily close to zero (can even be negative), so the term − ln x (k+1) in (4) will dominate in the Hill estimator as k is increased. This divergence also introduces meaningless minima in the RMSE curve at k ≈ 8 · 10 3 , which is where the bias curves cross zero before diverging. Figure 7 shows that varying the mean intensity of the source (λ −1 ) and the noise floor (l) changes the situation compared to figure 6. The best achievable error is essentially determined by how many observations are unaffected by experimental imperfections. If the mean intensity is increased (figure 7(a)), more observations are above the noise floor of the detector, which ultimately means that one has to discard fewer observations when estimating the tail index. Note that the blue line in figure 7(a) corresponds to considerable pre-transformation noise: the typical scale of the signal (λ −1 = 1/2) is exceeded by the scale of the noise (σ 1 = 1), resulting in oddly shaped profile. The same is true for the red line in figure 7(b): even though the noise floor is low (l = 10 2 ), it is exceeded by the scale of the additive noise ω 2 (σ 2 = 10 3 ).
If the process is pumped by bright squeezed vacuum, the problem becomes more technically involved. As figure 8 shows, the best achievable RMSE is significantly worse (10-20% instead of 3-4%), owing to the considerable bias of the estimators for this case. This is because the correction to power-law behavior decays as a power of ln x, and not x. It is therefore helpful if, after taking the logarithm of the observations, one uses the conditional maximum likelihood estimator for GAMMA(1/2, β) instead of the Hill estimator, which, as discussed before, is a conditional MLE for EXP(λ). That is, one should maximize the function with Γ(s, z) denoting the upper incomplete gamma function. It is straightforward to show that asymptotically, the value of β −1 that maximizes (8)  O 1/ ln x (k+1) . The maximization can be done numerically using α = 1/2 = fixed, with β 0 = γ H −1 as the starting point. As the dashed lines in figure 8 show, the estimator defined in (8) is indeed more efficient than the simple Hill estimator. The improvement is not as significant as one might hope for since the actual transformation applied to the input signal was K · sinh 2 (·) and not exp(·). Thus, in theory, to get better results, one should change ln x (i) to asinh x (i) /K ′ in (8), with K ′ = K · exp σ 2 1 β , however, this is problematic since K ′ depends on the unknown values of β (≡ 2/γ), σ 1 , and K.

Detector saturation
If the observations during an experiment are distorted by detector saturation, one can try and recalibrate the equipment so that all observations fall within the linear range of the detector. However, this is not necessarily trivial to do if the process is indeed heavy-tailed, but even if this is not a problem, it is still a good practice to try and evaluate the original data set instead of throwing it away. This, however, requires modifying the basic estimators introduced in section 3, which were constructed supposing that the largest observations are (close to being) Pareto distributed. Figure 9(a) shows that, as expected, the basic approach fails if there is detector saturation.
The three basic techniques presented in section 3 are easily modified to work with discarding the largest observations which are affected by detector saturation. For the least squares approaches on either the histogram or the QQ plot, one quite straightforwardly has to omit observations above a certain threshold, but otherwise the optimization is exactly the same as without the upper cutoff.
For generalizing the Hill estimator, one has to take advantage of the Rényi representation theorem according to which the scaled spacings of the order statistics of an exponential sample are themselves exponentially distributed: that is, if Z i , i = 1, . . . , N is an i.i.d. EXP(λ) sample, then S (i) ≡ i · (Z (i) − Z (i+1) ), i = 1, . . . , N − 1 are also i.i.d. EXP(λ), with Z (i) denoting the i th largest observation in the sample. Knowing that if X is Pareto, then ln X is exponential, discarding the j largest observations results in the following estimator: Note that setting j = 0 does yield the standard Hill estimator, as expected. As figure 9(b) shows, our suggested generalization indeed significantly improves the performance. When choosing the value of j, one has to, of course, discard the values affected by detector saturation. However, discarding too many values increases the RMSE, which is in the best-case scenario proportional to 1/ √ k − j. If, in order to reduce the bias, one chooses to only keep the measurements within a shorter interval [ LO < x ′ HI < x HI ), the number of measurements has to be multiplied by a factor of F ( . This ensures that on average, the same number of  Figures 6 -9 show the best attainable performance of the basic estimators. The problem in practice is how to decide how many points to take into account for calculating the estimators (k), especially if there is a large number of samples and evaluating each one is not feasible without automation. If the higher-order behavior is known, or can be estimated with reasonable accuracy, the optimal tail length (minimizing the RMSE) can be estimated as well. This is the approach followed by most of the literature, however, due to the difficulty of estimating higher-order behavior, the resulting algorithms can be quite involved (requiring the tuning of nuisance parameters) and often do not outperform simpler, heuristic approaches.  Table 3: Empirical RMSE of different estimators applied to 100 samples of size 10 4 , for different types of distributions, all of which had a positive finite extreme value index (γ ≡ α −1 ∈ {2, 1}). The "Cost" column refers to CPU time and complexity. The column "Model" refers to the toy model with a thermal pump with (λ = γ/2, σ 1 = 0, l = σ 2 = 10 3 , L = ∞). The best three values were indicated in different shades of green for each distribution.

Choosing tail length
We have implemented the estimation procedures shown in Table 3. We proposed procedures 1-3 in which a heuristic path stability approach was used to choose the tail length, mainly based on [81], and tweaked using [74] and [26] (for details, see A). This approach was included because it essentially emulates how one would choose a tail length from the Hill plot, and it also works with an arbitrary estimator. Table 3 of course, is not an exhaustive study of these procedures, especially since procedures 6-10 involve more than one nuisance parameter, which could have been used to fine-tune them. We did not do that, we used the default parameter values suggested in the sources to see how they perform "out of the box". Based on our tests, the procedure introduced by Guillou et al. [77] provided the most reliable results and had the further advantage of simplicity and speed. Our suggested path stability approach, which mimics visual evaluation, also works reasonably well combined with the Hill estimator.

Conclusion and Discussion
In this work, we first gave a brief overview of a basic toolkit that may be used to estimate the tail exponent of associated with a heavy-tailed sample. We devoted more attention to histograms than it is justified from the mathematical point of view as they are the default tools for many physicists, and discussed how to use them more efficiently. Subsequently, we discussed two simple alternatives: the QQ estimator and the Hill estimator through examples.
We addressed the challenges specific to intensity measurements in supercontinuum generation experiments. In order to do that, we introduced a model for the observable intensity distribution. Firstly, if the source in the experiment is bright squeezed vacuum, the estimation becomes considerably more inefficient than in the thermal case. We suggest using a modified version of the Hill estimator, defined in (8) for BSV. If the pump is thermal the plain Hill estimator (4) is sufficient. Next, one should check whether the observations were affected by detector saturation. This is easily done by preparing the empirical tail function of the sample. If the answer is yes, one has to discard the affected observations and apply the estimator introduced in (9). Finally, we also included comparison of procedures which can choose how many observations to take into account in an automated fashion.

Tail length
Rejected estimates Non-rejected estimates Rounded values Figure 10: Heuristic tail length selection algorithm. The thick black line corresponds to K, the green interval to I. The sample was created by taking applying sinh 2 (·) to an exponential sample and adding a Gaussian noise. The true value of γ for this sample is one.
4. Identify the longest interval I, within which the rounded estimates are constant. The original method suggests defining length as the number of elements (k 2 − k 1 + 1), we suggest per Drees et al. [74] defining length on the log scale (log(k 2 /k 1 )).

5.
Within I, calculate the estimates rounded to n + 2 digits, and identify the mode of these values, denoted by m.
6. The estimate of the tail is then gained by choosing the shortest tailk within I for whichγ(k) rounded to n + 2 decimal places is equal to m. The final estimate of γ is thenγ(k).