The perils of thresholding

The thresholding of time series of activity or intensity is frequently used to define and differentiate events. This is either implicit, for example due to resolution limits, or explicit, in order to filter certain small scale physics from the supposed true asymptotic events. Thresholding the birth-death process, however, introduces a scaling region into the event size distribution, which is characterised by an exponent that is unrelated to the actual asymptote and is rather an artefact of thresholding. As a result, numerical fits of simulation data produce a range of exponents, with the true asymptote visible only in the tail of the distribution. This tail is increasingly difficult to sample as the threshold is increased. In the present case, the exponents and the spurious nature of the scaling region can be determined analytically, thus demonstrating the way in which thresholding conceals the true asymptote. The analysis also suggests a procedure for detecting the influence of the threshold by means of a data collapse involving the threshold-imposed scale.


I. INTRODUCTION
Thresholding is a procedure applied to (experimental) data either deliberately, or effectively because of device limitations. The threshold may define the onset of an event and/or an effective zero, such that below the threshold the signal is regarded as 0. An example of thresholding is shown in Fig. 1. Experimental data often comes with a detection threshold that cannot be avoided, either because the device is insensitive below a certain signal level, or because the signal cannot be distinguished from noise. The quality of a measurement process is often quantified by the noise to signal ratio, with the implication that high levels of noise lead to poor (resolution of the) data. Often, the rationale behind thresholding is to weed out small events which are assumed irrelevant on large scales, thereby retaining only the asymptotically big events which are expected to reveal (possibly universal) large-scale physics.
Most, if not all, of physics is due to some basic interactions that occur on a "microscopic length scale", say the interaction between water droplets or the van der Waals forces between individual water molecules. These length scales separate different realms of physics, such as between micro-fluidics and molecular physics or between molecular physics and atomic physics. However, these are not examples of the thresholds we are concerned with in the following. Rather, we are interested in an often arbitrary microscopic length scale well above the scale of the microscopic physics that governs the phenomenon we are studying, such as the spatiotemporal resolution of a radar observing precipitation (which is much coarser than the scale set by microfluidics), or the resolution of the magnetometer observing solar flares, (which is much coarser than the scale set by atomic physics and plasma magnetohydrodynamics). Such thresholds often come down to the device limitations of the measuring apparatus, the storage facili-ties connected to it, or the bandwidth available to transmit the data. For example, the earthquake catalogue of Southern California is only complete above magnitude 3, even though the detection-threshold is around magnitude 2 [1]. One fundamental problem is the noise-to-signal ratio mentioned above. Even if devices were to improve to the level where the effect of noise can be disregarded, thresholding may still be an integral part of the measurement. For example, the distinction between rainfall and individual drops requires a separation of microscale and macroscale which can be highly inhomogeneous [2]. Solar flares, meanwhile, are defined to start when the solar activity exceeds the threshold and end when it drops below, but the underlying solar activity never actually ceases [3].
Thresholding has also played an important rôle in theoretical models, such as the Bak-Sneppen Model [4] of Self-Organised Criticality [5], where the scaling of the event-size distribution is a function of the threshold [6] whose precise value was the subject of much debate [7,8]. Finite size effects compete with the threshold-imposed scale, which has been used in some models to exploit correlations and predict extreme events [9].
Often, thresholding is tacitly assumed to be "harmless" for the (asymptotic) observables of interest and beneficial for the numerical analysis. We will argue in the following that this assumption may be unfounded: the very act of thresholding can distort the data and the observables derived from it. To demonstrate this, we will present an example of the effect of thresholding by determining the apparent scaling exponents of a simple stochastic process, the birth-death process (BDP). We will show that thresholding obscures the asymptotic scaling region by introducing an additional prior scaling region, solely as an artefact. Owing to the simplicity of the process, we can calculate the exponents, leading order amplitudes and the crossover behaviour analytically, in excellent agreement with simulations. In doing so, we highlight the importance of sample size since, for small samples (such as might be accessible experimentally), only the "spurious" threshold-induced scaling region that governs the process at small scales may be accessible. Finally, we discuss the consequences of our findings for experimental data analysis, where detailed knowledge of the underlying process may not be available, usually the mechanism behind the process of interest is unclear, and hence such a detailed analysis is not feasible. But by attempting a data collapse onto a scaling ansatz that includes the threshold-induced scale, we indicate how the effects of thresholding can be revealed.
The outline of the paper is as follows: In Sec. II we introduce the model and the thresholding applied to it.
To illustrate the problems that occur when thresholding real data, we analyse in detail some numerical data. The artefact discovered in this analysis finds explanation in the theory present in Sec. III. We discuss these findings and suggest ways to detect the problem in the final section.   In this case, two power laws can be fitted in two different regimes: below g X = 8πh, we findγ1 = 1.50070 (2) in the (fixed) range [10 −2 , 10 3 ], while above g X , the fit leads toγ2 = 1.998(4) over the range [1.99 · 10 5 , 3.16 · 10 8 ] , with a p-value of 0.55. Monte Carlo simulations are shown as symbols, while the small (large) regime power-law fit is plotted with full black lines, and the fitted range marked with red (blue) shading.

II. MODEL
In order to quantify numerically and analytically the effect of thresholding, we study the birth-death [10] process (BDP) with Poissonian reproduction and extinction rates that are proportional to the population size. More concretely, we consider the population size n(g) at (generational) time g ≥ 0. Each individual in the population reproduces and dies with the same rate of 1/2 (in total unity, so that there are n(g) birth or death events or "updates" per time unit on average); in the former case (birth) the population size increases by 1, in the latter (death) it decreases by 1. The state n(g) = 0 is absorbing [11]. Because the instantaneous rate with which the population n(g) evolves is n(g) itself, the exponential distributions from which the random waiting times between events are drawn are themselves parameterised by a random variable, n(g).
Because birth and death rates balance each other, the process is said to be at its critical point [12], which has the peculiar feature that the expectation of the population is constant in time, n(g) = n(g 0 ), where · denotes the expectation and n(g 0 ) is the initial condition, set to unity in the following. This constant expectation is maintained by increasingly fewer surviving realisations, as each realisation of the process terminates almost surely. We therefore define the survival time as the time g s − g 0 such that n(g) > 0 for all g 0 ≤ g < g s and n(g) = 0 for all g ≥ g s . For simplicity, we may shift times to g 0 = 0, so that g s itself is the survival time. It is a continuous random variable, whose probability density function (PDF) is well known to have a power law tail in large times, [12, as in the branching process].
In the following, we will introduce a threshold, which mimics the suppression of some measurements either intentionally or because of device limitations. For the BDP this means that the population size (or, say, "activity") below a certain, prescribed level, h, is treated as 0 when determining survival times. In the spirit of [13, also solar flares, 3], the threshold allows us to distinguish events, which, loosely speaking, start and end whenever n(g) passes through h.
Explicitly, events start at g 0 when lim →0 + n(g 0 − ) = h and n(g 0 ) = h+1. They end at g s when n(g s ) = h, with the condition n(g) > h for all g 0 ≤ g < g s . This is illustrated in Figs. 1 and 4. No thresholding takes place (i.e. the usual BD process is recovered) for h = 0, in which case the initial condition is n(g 0 ) = 1 and termination takes place at g s when n(g s ) = 0. For h > 0 one may think of n(g) as an "ongoing" time series which never ceases and which may occasionally "cross" h from below (starting the clock), returning to h some time later (stopping the clock). In a numerical simulation one would start n(g) from n(g 0 ) = h + 1 at g 0 = 0 and wait for n(g) to arrive at n(g) = h from above. The algorithm may be summarised as where ξ(n) is an exponential random variable with rate n, and b stands for a random variable that takes the values {−1, 1} with probability 1/2. In our implementation of the algorithm, all random variables are handled with the GNU Scientific Library [14].

A. Numerics and data analysis
Monte-Carlo runs of the model reveal something unexpected: The exponent of the PDF of the thresholded BDP appears to change from P (gs) (g s ) ∝ g −2 s at h = 0 to P (gs) (g s ) ∝ g −3/2 s at h = 100 or, in fact, any reasonably large h > ∼ 10. Fig. 2 shows P (gs) (g s ) for the case of h = 100 and two different sample sizes, N 1 = 10 3 and N 2 = 10 10 , corresponding to "scarce data" and "abundant data", respectively. In the former case, the exponent of the PDF is estimated to beγ 1 = 1.52(3) ≈ 3/2; in the latter, the PDF splits into two scaling regimes, with exponentsγ 1 = 1.50070(2) ≈ 3/2 andγ 2 = 1.998(4) ≈ 2. This phenomenon can be investigated systematically for different sample sizes N and thresholds h.
We use the fitting procedure introduced in Deluca and Corral [15], which is designed not only to estimate the exponent, but to determine the range in which a power law holds in an objective way. It is based on maximum likelihood methods, the Kolmogorov-Smirnov test and Monte Carlo simulations of the distributions. In Fig. 3 we show the evolution of the estimated large scale exponent,γ 2 , for different N and for different h. The fits are made by assuming that there is a true power law in a finite range [a,b]. For values of the exponent between 1.5 and 2 larger error bars are observed. For these cases, less data is fitted but the fitting range is always at least two orders of magnitude wide.
It is clear from Fig. 3 that N has to be very large in order to see the true limiting exponent. Even the smallest h investigated, h = 20, needs a sample size of at least N = 10 7 , while for h = 5 000 the correct exponent is not found with less than about N = 10 10 .
The mere introduction of a threshold therefore changes the PDF of events sizes significantly. It introduces a new, large scaling regime, with an exponent that is misleadingly different from that characterising large scale asymptotics. In fact, for small sample sizes (N 1 = 10 3 , see Fig. 2(a)), the only visible regime is that induced by thresholding (in our example, γ 1 = 3/2), while the second exponent (γ 2 = 2), which, as will be demonstrated below, governs the large scale asymptotics, remains hidden unless much larger sample sizes are used ( Fig. 2(b)).
Although the algorithm is easy to implement, finding the two scaling regimes numerically can be challenging. There are a number of caveats: (1) The crossover point g X between the two scaling regimes scales linearly with the threshold, g X = 8πh (see Sec. III B 1), effectively shifting the whole g −2 s asymptotic regime to larger and thus less likely values of g s . To maintain the same number of events (2) Because the expected running time of the algorithm diverges, one has to set an upper cutoff on the maximum generational timescale, say g s < G. If the computational complexity for each update is constant, an individual realisation, starting from n(0) = h + 1 and running up to n(g s ) = h with g s < G, has complexity O(g 2 s ) in large g s where g 2 s is the scaling of the expected survival time of the mapped random walker introduced below. The expected complexity of realisations that terminate before G (with rate ∼ 1/g 2 s ) is therefore linear in G, With the random walker mapping it is easy to see that the expected population size n(g) of realisations that terminate after G (and therefore have to be discarded as g s exceeds G) is of the order n(g s ) ∼ G for g s = G. These realisations, which appear with frequency ∝ 1/G, have complexity O(G 2 ), i.e. the complexity of realisations of the birth-death process is O(G) both for those counted into the final tally and those dismissed because they exceed G. There is no point probing beyond G if N is too small to produce a reasonable large sample on a logarithmic scale, N 2G G dg s g −2 s = const, so that N ∼ G and thus the overall complexity of a sample of size N is O(N 2 ) and thus O(h 2 ) for G ∼ g X ∼ h and N ∝ h from above.
That is, larger h necessitates larger N , leading to quadratically longer CPU time. In addition, parallelisation of the algorithm helps only up to a point, as the (few) biggest events require as much CPU time as all the smaller events taken together. The combination of all these factors has the unfortunate consequence that, for large enough values of h, observing the P (gs) (g s ) ∝ g −2 s regime is simply out of reach (even for moderate values of h, such as h = 100, to show the crossover as clearly as in Fig. 2, a sample size as large as N = 9·10 9 was necessary, which required about 1810 hours of CPU time).

III. RESULTS
While it is straightforward to set up a recurrence relation for the generating function if the threshold is h = 0, the same is not true for h > 0. This is because the former setup (h = 0) does not require an explicit implementation of the absorbing wall since the process terminates naturally when n(g) = 0 (there is no individual left that can reproduce or die). If, however, h > 0, the absorbing wall has to be treated explicitly and that is difficult when the evolution of the process (the effective diffusion constant) is a function of its state, i.e. the noise is multiplicative. In particular, a mirror charge trick cannot be applied.
However, the process can be mapped to a simple random walk by "a change of clocks", a method detailed in [16]. For the present model, we observe that n(g) performs a fair random walk r t by a suitable mapping of the generational timescale g to that of the random walker, r t (g) = n(g) with t(g) ∈ N. In fact, because of the Poissonian nature of the BD process, birth and death almost surely never occur simultaneously and a suitable, unique t(g) is found by t(0) = 0 and lim i.e. t(g) increases whenever n(g) changes and is therefore an increasing function of g. With this map, r t is a simple random walk along an absorbing wall at h, see Fig. 5.
The challenge is to derive the statistics of the survival times g s on the time scale of the BD process from the survival times t s on the time scale of the random walk.
In the following, we first approximate some important properties of the survival times in a handwaving manner before presenting a mathematically sound derivation in Sec. III B.

A. Approximation
The expected waiting time 1 between two events in the BDP is 1/n, if n is the current population size, with n = n x + h such that n x is the excess of n above h. As discussed in detail in Sec. III B, n x is a time-dependent random variable, and so taking the ensemble average of the waiting time is a difficult task. But on the more convenient time scale t, the excess n x performs a random walk and it is in that ensemble, with that time scale, where we attempt to find the expectation which is the expected survival time of a thresholded BD process given a certain return (or survival) time t s of the random walker. In this expression n x (t) is a timedependent random variable and the ensemble average · R(ts) is taken over all random walker trajectories R(t s ) with return time t s . To ease notation, we will include the argument of R(t s ) only where necessary. Approximating the random variable g s by its mean g s (t s ; h) given in Eq.
(2) and approximated further below affords an approximate map of the known PDF P (ts) (t s ) of t s to the PDF P (gs) (g s ) of g s , This map will be made rigorous in Sec. III B, avoiding the use of g s (t s ; h) in lieu of the random variable.
In a more brutal approach, one may approximate the time dependent excess n x (t) in Eq. (2) by its expectation = 1 h + n x (t) R + (higher order terms) (4) so that the expected survival time g s (t s ) given a certain return time t s is approximately t s /(h + n x (t) R ). The quantity n x (t) R is the expected excursion of a random walker, which is well-known to be in the continuum limit (with diffusion constant 1/2) [e.g. 17,18]. Thus, At small times, h πt s /8, the relation between g s and t s is essentially linear, g s ≈ t s /h, whereas for large times, h πt s /8, the asymptote is g s ≈ 8t s /π. Writing the right-hand-side of Eq. (6) in the form 8t s /π allows us to extract the scaling of the crossover time. The argument of the square root is of order unity when t X = 8h 2 /π, for which g s (t X , h) ≈ 4h/π. Moreover, one can read off the scaling form with G(x) = 8/π/(1 + 8/(πx)) and asymptotes G(x) ≈ √ x for small x and lim x→∞ G(x) = 8/π.
The PDF of the survival time of a random walker along an absorbing wall is well-known to be a power law ∝ t −3/2 s for times t s large compared to the time scale set by the initial condition, i.e. the distance a of the random walker from the absorbing wall at time t = 0. The precise value of a is effectively determined by the details the continuum approximation, here a = 1, D = 1/2, and so we require 1 2t s . To derive the PDF of the BD process, note that Eq. (6) has the unique inverse t s (g s ) = πg 2 s 16 T ( 16h πgs ), where T (y) = 1+y + √ 1 + 2y. Evaluating the crossover time by setting y = 1 yields g X = 16h/π. The PDF of the survival time of the BD process finally reads where y = 16h πgs . For small y, the last bracket converges to 2, so P (gs) (g s ; h) ∼ 2 8/πg −2 s for large g s . For large y, the last bracket converges to 1, so P (gs) (g s ; for small g s . This procedure recovers the results in Sec. III B: For g s 16h/π the PDF of the survival times in the BD process goes like g −3/2 s , and for g s 16h/π like g −2 s , independent of h. Eq. (9) also gives a prescription for a collapse, since P (gs) (g s ; h) g 2 s plotted versus g s /h should, for sufficiently large g s , reproduce the same curve, as confirmed in Fig. 7 and Fig. 8.
Applying a threshold introduces a new scale, 16h/π, below which the PDF displays a clearly discernible power law, g −3/2 s , corresponding to the return time of a random walker. The "true" g −2 s power law behaviour (the large g s asymptote) is visible only well above the thresholdinduced crossover.

B. Detailed Analysis
In the previous section we made a number of assumptions, in particular the approximation of replacing the random variable by its expectation, and the approximation in Eq. (4), which both require further justification.
In the present section we proceed more systematically. In particular, we will be concerned with the statistics of the BD survival time g s (R) given a particular trajectory R = {r 0 , r 1 , . . . , r ts } of the random walk, where t s = 2T − 1, necessarily odd, T ∈ N, see Figs. 5 and 9. We will then relax the constraint of the trajectory and study the whole ensemble Ω of random walks terminating at a particular time 2T − 1, denoting as g s (Ω(T )) a survival time drawn from the distribution of all survival times of a BD process with a mapping to a random walker that terminates at 2T − 1 or, for simplicity, just g s (Ω). This will allow us to determine the existence of a limiting distribution for g s (Ω)/ √ T and to make a quantitative statement about its mean and variance. We will not make any assumptions about the details of that limiting distribution; in order to determine the asymptotes of P (gs) (g s ; h) we need only know that the limit exists.
For a given trajectory R of the random walk, the resulting generational survival time g s (R) may be written as where ξ t (α) is a random variable drawn at time t from an exponential distribution with rate α, i.e. drawn from α exp (−αξ), and r t is the position of the random walk at time t, with initial condition r 0 = 1 and terminating at 2T − 1 with r 2T −1 = 0 (see Fig. 9). The mean and standard deviation of ξ t are 1/(r t + h), necessarily finite, so that by the central limit theorem the limiting distribution of g s (R)/ √ T given a trajectory R is Gaussian (for T 1). This ensures that g s (Ω)/ √ T has a limiting distribution (see Appendix B).
It is straightforward to calculate the mean and standard deviation of g s (R) for a particular trajectory R that terminates after 2T − 1 steps. Slightly more challenging is the mean µ(Ω) and variance σ 2 (Ω) of g s (Ω) for the entire ensemble Ω of such trajectories. The details of this calculation are relegated to Appendix A. Here, we state only the key results. For the mean of the survival time, we find (see Eq. (A22)) with ψ(x) = exp −x 2 (Ei(x)−πE(ıx)/ı) and asymptotes see Eq. (A24). The variance is (see Eq. (A27)) with integrals I(x) and K(x) defined in Eq. (A28) and with asymptotes see Eq. (A32). All these results are derived in the limit T 1 in which the mapped random walker takes more than just a few steps, corresponding to a continuum approximation. However, as shown in the following, the results remain valid even for T close to one.
To assess the quality of the continuum approximation and the validity of the asymptotes, we extracted the mean µ(Ω(T )) and variance σ 2 (Ω(T )) of the survival time g s (Ω(T )) from simulated BDPs starting with a population size n(0) = h + 1 and returning to n(g s ) = h after 2T − 1 updates (births or deaths), i.e. the process was conditioned to a particular value of T . In particular, we set the threshold at h = 100, and simulated a sample of 10 5 constrained BDPs for values T = 2 k , k = 0 . . . 20. The results are shown in Fig. 6 and confirm the validity of the large T 1 approximation in Eq. (11) and Eq. (13), as well as the asymptotes Eq. (12) and Eq. (14). Remarkably, as previously stated, Eq. (11) and Eq. (13) are seen to be valid even when the condition T 1 does not reasonably hold.

Distribution of gs
For large T , the generational survival time g s given a survival time 2T − 1 of the mapped random walk has PDF P (gs) (g s ; h; T ) 1 where Φ(x) denotes the limiting distribution of the rescaled survival time (g s − µ(Ω(T )))/ σ 2 (Ω(T )), and the mean µ(Ω(T )) and variance σ 2 (Ω(T )) are given by Eq. (11) and Eq. (13). We demonstrate that Φ exists and find its precise (non-Gaussian) form in Appendix B for completeness, but we will not use this result in what follows: to extract the asymptotic exponents and first order amplitudes, see below, knowledge of the mean µ(Ω) and variance σ 2 (Ω) is sufficient.
As the ensembles Ω(T ) are disjoint for different T , the overall distribution P (gs) (g s ; h) of survival generational times is therefore given by the sum of the constrained distribution P (gs) (g s ; h; T ) weighted by the probability of the mapped random walk to terminate after 2T − 1 steps. In the limit of large T , as assumed throughout, that weight is T −3/2 /(2 √ π) [19]. Therefore, σ 2 (Ω(T )) .
(16) To extract asymptotic behaviour for T h 2 and T h 2 we make a crude saddle point, or "pinching" approximation, by assuming that Φ(x) essentially vanishes for |x| > 1/2 and is unity otherwise. This fixes the random walker time T via g s − µ(Ω(T )) = 0, while the number of terms in the summation is restricted to satisfy |g s − µ(Ω(T ))| ≤ σ 2 (Ω(T )). After some algebra we find The qualitative scaling of these two asymptotes was anticipated after Eq. (9). The crossover time g X = 8πh, shown in Figs. 7 and 8, can be determined by assuming continuity of P (gs) (g s ; h) and thus imposing Fig. 7 shows P (gs) (g s ; h) g 2 s versus g s /h for varying h, comparing Monte Carlo simulations for varying h with the numerical evaluation of Eq. (16) for h = 100, thus confirming the validity of the data collapse proposed in Eq. (9). In particular, the shape of the transition between the two asymptotic regimes, predicted to take place near g X /h = 8π, is recovered from Eq. (16) with great accuracy. As an alternative to the numerical evaluation of Eq. (16), we introduce in Appendix C a complementary approach that provides the Laplace transform of P (gs) (g s ; h), see Eq. (C4). Unfortunately, inverting the Laplace transform analytically does not seem feasible, but numerical inversion provides a perhaps simpler means of evaluating P (gs) (g s ; h) in practice.
In addition to the two asymptotic regimes discussed so far, one notices that Fig. 8 displays yet another "regime" (left-most, green shading), which corresponds to extremely short survival times. This regime is almost exclusively due to the walker dying on the first move via the transition n(0) = h + 1 to n(g s ) = h. In this case, the sum in Eq. (10) only has one term, and hence the PDF of g s can be approximated as P (gs) (g s ; h) = 1 2 (h + 1) exp (−(h + 1)g s ) ∼ h+1 2 , where the factor 1/2 corresponds to the probability of T = 1, and the limit of small g s has been taken. Thus, for very short times g s 1/h, the PDF of g s is essentially "flat". In order to estimate the transition point to this third regime, we impose again continuity of the solution, so that  Given the three regimes shown in Fig. 7, P (gs) (g s ; h) can be collapsed either by ignoring the very short scale, (see Eq. (9)) with G > (x) = 1 for large x and G > (x) = x/(8π) in small x, or according to (19) with G < (x) = 1 for large x and G < (x) = x 3/2 π/2 for small x. Power-law scaling (crossover) functions offer a number of challenges, as they affect the "apparent" scaling exponent [20]. Also, there is no hard cutoff in the present case, i.e. moments g m s = dg s P (gs) (g s ; h) g m s do not exist for m ≥ 2.

IV. DISCUSSION
The main goal of the present paper has been to understand how thresholding influences data analysis. In particular, how thresholding can change the scaling of observables and how one might detect this.
To this end, we worked through the consequences of thresholding in the birth-death process, which is known to have a power-law PDF of survival times with exponent γ = 2. We have shown, both analytically and via simulations, that the survival times g s for the thresholded process include a new scaling regime with exponent γ = 3/2 in the range 1/h g s 8πh (see Fig. 8), where h is the intensity level of the threshold.
We would like to emphasise how difficult it is to observe the asymptotic γ = 2 exponent, even for such an idealized toy model. For large values of the threshold, h = 5 000, sample sizes as large as 10 10 are needed in order to populate the histogram for large survival times. Real-world measurements are unlikely to meet the demand for such vast amounts of data. An illustration of what might then occur for realistic amounts of data that are subject to threshold is given by Fig. 2, where only the threshold-induced scaling regime associated with exponent −3/2 is visible.
Intriguingly, a qualitatively similar scaling phenomenology is observed in renormalised renewal processes with diverging mean interval sizes [21]. The random deletion of points (that, together with a rescaling of time, constitutes the renormalisation procedure) is analogous to the raising of a threshold. It can be shown that the non-trivial fixed point distribution of intervals is bipower law. The asymptotic scaling regime has the same exponent as that of the original interval sizes. But, in addition, a prior scaling regime emerges with a different exponent, and the crossover separating the two regimes moves out with increasing threshold.
A fundamental difference between theoretical models and the analysis of real-world processes is that in the former, asymptotic exponents are defined in the limit of large events, with everything else dismissed as irrelevant, whereas real world phenomena are usually concerned with finite event sizes. In our example, the effect of the threshold dominates over the "true" process dynamics in the range 1/h g s 8πh, and grows with increasing h before eventually taking over the whole region of physical interest.
Of course, real data may not come from an underlying BDP. But we believe that the specific lessons of the BDP apply more generally to processes with multiplicative noise, i.e. a noise whose amplitude changes with the dynamical variable (the degree of freedom): At large thresholds small changes of that variable are negligible and an effectively additive process is obtained (the plain random walker above). Only for large values of the dynamical variable is the original process recovered. These large values are rare, in particular when another cutoff (such as, effectively, the sample size) limits the effective observation time (2T − 1 above). In this context it is worth mentioning the work of Laurson et al. [22], in which thresholds were applied to Brownian excursions. However, since noise is additive in Brownian motion, the effect of thresholding is relatively benign. Indeed, no new scaling regime appears as a result of thresholding, and the asymptotic exponent of −3/2 is recovered no matter what threshold is applied.
In the worst case, thresholding may therefore bury the asymptotics which would only be recovered for much longer observation times. However, if the threshold can easily be changed, its effect can be studied systematically by attempting a data collapse onto the scaling ansatz P (gs) (g s ; h) = g −γ s G(g s /h D ), Eqs. (9) and (18), with ex-ponents γ and D to be determined, as performed in Fig. 7 with γ = 2 and D = 1. The threshold plays an analogous rôle to the system size in finite-size scaling (albeit for intermediate scales). In the present case, the exponents in the collapse, together with the asymptote of the scaling function, identify two processes at work, namely the BDP as well as the random walk.

Appendix A: Mean and variance of the survival time
This appendix contains the details of the calculations leading to the approximation (in large T ), Eq. (11) and Eq. (13), as well as their asymptotes Eq. (12) and Eq. (14), for the mean µ(Ω) and the variance σ 2 (Ω) respectively, averaged over the ensemble Ω(T ), or Ω for short, of the mapped random walks with the constraint that they terminate at 2T − 1, see Fig. 9.
In the following, we will use the notation ξ t for ξ t (r t + h), but it is important to note that any two ξ t (r t + h) are independent, even though the consecutive r t are not. The random variable g s (R) in Eq. (10) is thus a sum of independent random variables ξ t , whose mean and variance at consecutive t, however, are correlated due to r t being a trajectory of a random walk. Because h + r t > 0 for t < 2T − 1, the limiting distribution of (g s (R) − µ(R))/ σ 2 (R) as T → ∞ is a Gaussian with unit variance. Mean µ(R) and variance σ 2 (R) are defined as and are functions of the trajectory R with · R taking the expectation across the ensemble of ξ for given, fixed R, i.e. ξ t R = 1/(r t + h) and ξ 2 t R − ξ t 2 R = 1/(r t + h) 2 . Because ξ t ξ t R = ξ t R ξ t R for t = t the mean and the variance are in fact just If ρ n (R) counts the number of times r t attains a certain level, where we used the fact that within time 2T − 2 our random walker cannot stray further away from r 0 than T − 1 + r 0 , as illustrated in Fig. 9.
In the same vein, we can now proceed to find mean and variance of g s over the entire ensemble Ω = Ω(T ) of trajectories R that terminate at 2T − 1. In the following · Ω denotes the ensemble average over all trajectories R ∈ Ω, each appearing with the same probability where f (ξ t ) is an arbitrary function of the random variable ξ t . We therefore have where ρ n (R) Ω is in fact the expected number of times a random walker terminating at 2T − 1 attains level n.
The variance turns out to require a bit more work. The second moment simplifies significantly when t = t in which case the lack of correlations means that the expectation factorises ξ t ξ t R = ξ t R ξ t R , so that we can write , but that is not a useful simplification for the time being.
The square of the first moment, Eq. (A6), is best written as so that The first and the last pair of sums can be written as using R (1/|Ω|) = 1, so that In the first sum, the two terms can be separated into those in t and one in t. Using the same notation as above, Eq. (A3) we have n+h . The second sum recovers the earlier result in Eq. (A4b), as ξ 2 t R = 2 (rt+h) 2 and ξ t R = 1 rt+h , so that and therefore We now have the mean µ(Ω), Eq. (A6), and the variance σ 2 (Ω), Eq. (A15), in terms of ρ n (R) Ω and ρ n (R)ρ n (R) Ω . In the following, we will determine these two quantities and then return to the original task of finding a closed-form expression for µ(Ω) and σ 2 (Ω).
a. ρn(R) Ω and ρn(R)ρ n (R) Ω Of the two expectations, ρ n (R) Ω is obviously the easier one to determine. In fact, n ρ n (R) = 2T − 1 implies To determine ρ n (R) Ω , we use the method of images (or mirror charges). The number of positive paths (r i > 0) from (t = 0, r 0 ) to (t, n) are t n−r 0 +t 2 − t n+r 0 +t 2 for n + r 0 + t even and n > 0. By construction, the number of paths passing through n = 0 is exactly 0, thereby implementing the boundary condition. The set of paths (to be considered in the following) which terminate at time 2T − 1 by reaching r 2T −1 = 0 is, up to the final step, identical to the set of paths passing through (2T − 2, 1), i.e. r 2T −1 = 0. The number of positive paths (see Fig. 9) originating from (0, r 0 = 1) and terminating at (t = 2T − 1, r 2T −1 = 0) therefore equals the number of positive paths from (0, r 0 = 1) to (t = 2T − 2, n = 1), so that |Ω| = 2T −2 T −1 , which are the Catalan numbers [23,24]. For r 0 = 1 we also have again for n + r 0 + t even. This is the number of positive paths from (0, 1) to (t, n) and by symmetry also the number of paths from (2T − 2 − t, n) to (2T − 2, 1), given that the walk is unbiased (see Fig. 9). If ρ n (t; R) Ω is the expected fraction of paths passing through (t, n) (illustrated in Fig. 9), we therefore have which is normalised by construction, i.e. n ρ n (t; R) Ω = 1. The first binomial factor in the denominator is due to the normalisation, whereas of the last two, the first is due to paths from (0, 1) to (t, n) and the second due to paths from (t, n) to (2T − 2 − t, 1). In the following we are interested in the fraction of times a random walker reaches a certain level during its lifetime, ρ n (R) Ω = t ρ n (t; R) Ω . Using a b 2 a (aπ where we have used T 1 andt = t + 1. Simplifying further gives with the sum running over thet with the correct parity and τ =t/T and ν = n/ √ T . In the limit of large T 1 we find [25] lim where the parity has been accounted for by dividing by 2. In the last step, the integral was performed by some substitutions, as τ (2 − τ ) is symmetric about 1. It follows that in the limit of large T 1 Using that expression in Eq. (A6) gives Eq. (11), namely with [26, Eq. 27.6.3] where we have used r 0 = 1. The first integral is known as the exponential integral function − ∞ −x dy exp(−y) y = − Ei(x) and the second as (a multiple of) the imaginary error function 2 √ π x 0 ds exp s 2 = πE(ıx)/ı. In the limit of large arguments x, the function ψ(x) is − √ π/x + 1/x 2 − √ π/(2x 3 ) + 1/x 4 + O(x −5 ), in the limit of small arguments by γ + 2 ln(x), where γ is the Euler-Mascheroni constant. We conclude that (see Eq. (12)) provided T is large compared to 1, which is the key assumption of the approximations used above. It is worth stressing this distinction: T has to be large compared to 1 in order to make the various continuum approximations (effectively continuous in time, so sums turn into integrals and continuous in state, so binomials can be approximated by Gaussians), but no restrictions were made regarding the ratio T /h 2 .
We are now in the position to determine the relevant asymptotes of σ 2 (Ω), as stated in Eq. (14), Appendix B: Limiting distribution of gs(Ω)/ √ T In this second appendix, we explicitly find the limiting distribution of g s (Ω)/ √ T . We begin by noting that, for T 1, g s (Ω) can be approximated as g s (Ω) 2T 0 dt 1 x(t)+h , where x(t) performs a Brownian excursion of length 2T . While for large but finite T this is clearly an approximation (e.g. the exponential random variables have been replaced by their mean), in the limit of T → ∞ the approximation becomes exact. In particular, the "noise" due to the variance of the exponential random variables scales like log T , see Eq. (A28), and thus vanishes after rescaling with respect to √ T . In addition, owing to the scaling properties of Brownian motion, lim T →∞ where x(t) is a Brownian excursion of length 2. To find the distribution of this quantity, we first define y(t) = t 0 dt 1/x(t ), and the propagator Z(x, y, x 0 , y 0 , t), i.e. the probability for a Brownian particle to go from (x 0 , y 0 ) to (x, y) in time t, without touching the line x = 0. Using standard techniques [27], the associated Fokker-Plank equation for the propagator takes the form with initial condition Z(x, y, x 0 , y 0 , 0) = δ(x − x 0 )δ(y − y 0 ), and boundary condition Z(0, y, x 0 , y 0 , t) = 0.
Taking the Laplace transform with respect to t yields s + 1 x ∂ y − 1 2 ∂ xx Ẑ (x, y, x 0 , s) = δ(x − x 0 )δ(y) (B5) Z(0, y, x 0 , s) = 0 (B6) We first solve the associated homogeneous equation, from which we will be able to construct the solution to the inhomogeneous problem. After substituting the ansatzẐ hom (x, y, s) = Ψ(x, s)ρ(y, s), the equation separates into The first equation is obtained by collecting residues from double poles, and is useful for a small y expansion. The second equation is obtained by expanding Eq. (B13) and inverting term by term, and is useful for a large y expansion. Both expressions converge rapidly and, evaluating at t = 2, are in excellent agreement with simulations, see Fig. 10.