Replica analysis of overfitting in regression models for time to event data: the impact of censoring

We use statistical mechanics techniques, viz. the replica method, to model the effect of censoring on overfitting in Cox’s proportional hazards model, the dominant regression method for time-to-event data. In the overfitting regime, Maximum Likelihood (ML) parameter estimators are known to be biased already for small values of the ratio of the number of covariates over the number of samples. The inclusion of censoring was avoided in previous overfitting analyses for mathematical convenience, but is vital to make any theory applicable to real-world medical data, where censoring is ubiquitous. Upon constructing efficient algorithms for solving the new (and more complex) Replica Symmetric (RS) equations and comparing the solutions with numerical simulation data, we find excellent agreement, even for large censoring rates. We then address the practical problem of using the theory to correct the biased ML estimators without knowledge of the data-generating distribution. This is achieved via a novel numerical algorithm that self-consistently approximates all relevant parameters of the data generating distribution while simultaneously solving the RS equations. We investigate numerically the statistics of the corrected estimators, and show that the proposed new algorithm indeed succeeds in removing the bias of the ML estimators, for both the association parameters and for the cumulative hazard.


Introduction
Inference relies on the foundations provided by classical statistical theory, which was developed for use in settings where the number p of covariates is small compared to the number n of observations [1,2].The standard Maximum Likelihood (ML) method for estimating model parameters indeed fails in the high-dimensional regime where p = O(n) [3,4,5,7,8,9], due to overfitting.Overfitting is the phenomenon that data noise is misinterpreted as signal, leading to biased parameter estimators with large sample to sample fluctuations.An overfitting model will predict outcomes for training examples well, but will fail in predicting outcomes for new data.Early strategies to mitigate overfitting include leaving out covariates (with the risk of overlooking relevant predictive information), penalization (equivalent to adding a parameter prior in Bayesian language), and shrinking parameter estimates after model fitting.The penalization weight or shrinking factor are typically estimated via bootstrapping or cross-validation, which forces one to sacrifice some of the training data; this makes the overfitting even worse.Moreover, penalization and shrinking may at best repair the overfitting-induced bias in the length of the parameter vector, but not the bias in its direction (which will appear as soon as the covariates are correlated [17]).
In order to use existing regression models also in the overfitting regime, it is vital to correctly quantify and undo the effects of overfitting, both for inference and for prediction purposes.As a consequence, in recent years we have seen increased research efforts aimed at extending the classical inference theory to the so-called proportional asymptotic regime, characterized by taking the limit n → ∞ while keeping the ratio ζ = p/n fixed.Several mathematical methods from the domains of the statistical physics of disordered system [5,9,17,12], computer science [8,10] and statistics [3,4,7], have by now been applied successfully in this latter regime to model the statistics of ML and Maximum A Posteriori Probability (MAP) estimators.
The condition p ≪ n for ML/MAP inference to be used safely is especially problematic in post-genome medicine: we can now routinely measure very many variables per patient and want to use these to develop personalized therapies, but overfitting prevents us from doing so.For time-to-event data, the most common type in medicine, the main regression tools are variations on the model of Cox [11].Overfitting in this model was analysed in [5,6] as a statistical mechanics problem, via the replica method [20].A later study [10] used the Convex Gaussian Min-max theorem [8], with the logarithm of Cox's partial likelihood as utility function, to study overfitting in the association parameters.The latter route avoids the use of replicas, but unlike [5,6], cannot model overfitting in the base hazard rate.
All studies of overfitting in Cox models have so far assumed for simplicity that the data were not subject to censoring.Censoring means that samples are lost to observation prior to events occurring.It is ubiquitous in medicine, since medical studies are always of finite duration and also since patients often fail to return to hospital appointments for unknown reasons.Before the modern overfitting analysis methods for the proportional asymptotic regime, and their corresponding overfitting bias decontamination formulae, can be applied in the real world, it is hence vital that the theory of [5,6] is extended to include censoring.That is the first aim of this paper.
We generalise the theory of [5,6] by allowing for arbitrary types of non-informative censoring, carry out a replica analysis in the regime p, n → ∞ with fixed ζ = p/N , and derive the extended RS equations.Methodologically we also improve upon the approach in [5,6] by: (i) using an more compact form of the replica approach, (ii) avoiding the variational approximation for the base hazard rate, and (iii) constructing novel and more powerful numerical algorithms for solving and inverting the RS equations, including in cases where one does not know anything about the data generating distribution, resulting in precise algorithms for decontaminating estimators of association parameters and the integrated base hazard for overfitting-induced bias.
This paper is organized as follows.In Section 2 we define our notation and describe briefly the Cox model.We explain the results obtained via the replica method (whose derivation is relegated to an Appendix) and the physical interpretation of the RS order parameters in Section 3, and show in Section 4 how this interpretation inspires an efficient numerical method for solving the RS equations.In Section 5 we test the predictions of the theory against numerical simulations, and find perfect agreement.We construct in Section 6 a new algorithm for simultaneously inferring relevant characteristics of the data generating distribution and solving the RS equations, leading to a realistic and practical tool for correcting the biased ML estimates of association parameters and the nuisance parameters (i.e. the integrated base hazard) in the overfitting regime, even in the presence of censoring.We conclude in Section 7 with a discussion of our results and future research.

Definitions
In time to event data, each observation ideally reports the time T at which the subject experiences the event under investigation and the covariate vector z = (z 1 , . . ., z p ) ∈ R p , i.e. the list of characteristics of the subject measured at time zero.All covariate vectors are assumed to have been drawn independently from some distribution p(z).Since any non-zero average will drop out of our formulae, we can always take p(z) to have average zero.However, in virtually all real-world time-to-event data sets observations are censored, i.e. for some subjects we have a missing or partial observation of their event times.For instance, we might only know that an event occurred after or before a certain time point, or within a specific interval (called right, left and interval censoring, respectively).In what follows we focus on right censoring, the most common form of censoring in medical applications.In practice this amounts to assuming that when collecting data we actually observe t = min{T, C}.
(1) The random variable C models the censoring time, drawn randomly from an a priori unknown distribution p c (x), and we are given a binary variable that indicates whether the subject has at time t experienced the event (∆ = 1) or has been censored (∆ = 0): Note that we can always write p c (x) = λ c (x)e −Λc(x) , where λ c (x) is the censoring rate and Λ c (x) = x 0 dx ′ λ c (x ′ ).The Cox semi-parametric model [11] assumes that the time at which a subject experiences the event is distributed according to p T (x|z) = λ 0 (x)e β0•z−Λ0(x) exp(β0•z) , (3) where β 0 ∈ R p , λ 0 (x) is the true (non-negative) hazard rate, and Λ 0 (x) = x 0 dx ′ λ 0 (x ′ ) is the true cumulative hazard.The censoring times C and the event times T are assumed to be statistically independent, i.e. censoring time is said to be noninformative, and the joint distribution of observed times t ≥ 0 and event type indicators ∆ ∈ {0, 1}, given covariates z, can then be written as The parameters one seeks to infer are the true vector β 0 and the true function λ 0 (t).It follows from (4) upon computing the log-likelihood density for a data-set of i.i.d.
The problem (5) can be reduced by first maximizing over the hazard rate.Setting the functional derivative with respect to λ to zero gives an equation that can be solved for λ(t), leading to the so-called Breslow estimator [11]: where θ(x) denotes the step-function (θ(x > 0) = 1 and θ(x < 0) = 0).Substituting (7) into (6) and disregarding terms that are independ of β we obtain βn = argmax β L n (β|D) (8) The right-hand side of ( 9) is the logarithm of the famous Cox partial likelihood [11].

Typical behaviour of the Maximum (Partial) Likelihood estimator
Computing the estimator βML in ( 8) is equivalent to finding the ground state of a fictitious physical system with Hamiltonian H n (β|D) = −L n (β|D), in which the association parameters β act as the degrees of freedom, and the data-set D plays the role of quenched disorder.Statistical physics thus allows us to compute the average over the disorder (i.e. the data D) of the log-partial likelihood, giving the typical behaviour of the ML estimator, in the proportional asymptotic regime as lim n,p→∞, ζ=p/n In fact, for reasons that will become clear, we insert a constant regularization factor (whose impact will be zero due to the limit γ → ∞), and replace (10) by The standard procedure to carry out the disorder average in parametric regression models via the replica method (which goes back to [14]) builds on the assumption that the log-likelihood is a sum of n independent terms, each depending on a single observation (t i , ∆ i , z i ).For the present Hamiltonian H n (β|D) this is not the case, so we need an intermediate step.We introduce the empirical distribution and observe that we can write the Hamiltonian (i.e.minus the log-partial likelihood) as the following functional: We can now write (11) in a form where the disorder average effectively is computed over the density of states associated with (12).Upon introducing suitable delta-functionals one then achieves factorisation of the disorder average over the samples i in the data set D. The full replica derivation is given in Appendix A for completeness, and gives upon making the so-called replica-symmetric (RS) ansatz: with Here we used the standard short-hand Dy = (2π) − 1 2 e − 1 2 y 2 dy, and W (x) is the Lambert W -function [27], defined by the equation W (x) exp[W (x)] = x for all x.The only condition on p(z) needed in the derivation of the above RS equations is that for p → ∞ the true and inferred linear predictors β •z acquire Gaussian statistics ‡.The extremum (u ⋆ , v ⋆ , w ⋆ ) in ( 15) is found to satisfy the so-called RS equations: These can be written in alternative forms, for instance by using identities such as and via integration by parts over z in (20), giving For the distribution P(t, ∆, h) and the functions Λ(t) and S(t) we find in RS ansatz the following expressions (see Appendix A): We note that an alternative way to derive equations (20,21,22) (and a corresponding equation for Λ(t)) would have been to make appropriate choices such as y → (t, ∆) and θ → {λ 0 (t), λ C (t)} in the RS formulae of [17].However, that alternative route would not have generated the powerful expression (26) found with the present approach.By retracing its derivation and making simple adaptations, (26) can be generalized to where (with the standard convention that ζ = p/n is kept fixed in the limits n, p → ∞).This result shows very transparently the impact of overfitting on the inferred risk factors β • z i , relative to their true values At this point is is helpful to write w = Sκ.The interpretation of the values (κ ⋆ , v ⋆ ) as solved from the RS equations above can be inferred directly from the results in [17]: with p/n = ζ fixed as p, n → ∞.The importance of (κ ⋆ , v ⋆ ) derives from the facts that (asymptotically) the asymptotic distribution of βn depends only these two quantities [12], and that the overfitting induced inference bias and noise can be expressed in terms of κ ⋆ and v ⋆ , respectively.The function Λ(t) has the following interpretation: with p/n = ζ fixed, and with the Breslow estimator as given in (7).

Numerical solution of RS equations
For fully parametric GLM models in the overfitting regime the RS equations can always be solved numerically in a relatively straightforward iterative manner by evaluating the relevant integrals via Gaussian quadratures.Here, for the Cox model with censoring, we have the added complication of the functional order parameter Λ(t), which suggests we take a different approach inspired by population dynamics algorithms and our interpretation (26,27,28).For any given values of the scalar order parameters (u, v, w), and given values of S = |A 1/2 β 0 |, we can generate m ≫ 1 samples from the distribution resulting in {(t ℓ , ∆ ℓ , y ℓ , z ℓ )} m ℓ=1 .We then define for each (u, v, w, Λ) the quantities ξ ℓ (u, v, w, Λ) = ξ(t ℓ , ∆ ℓ , y ℓ , z ℓ |u, v, w, Λ), computed via (18), and the estimator Λm (t): For sufficiently large population size m, the iterative algorithm defined by the mapping Λ(.) → Λ ′ (.) = Λm (.|u, v, w, Λ) will now by construction have as its fixed-point the solution of (26,27,28).Furthermore, it is relatively easy to invert numerically equation (24) and write u for fixed (v, w, Λ) as a function of ζ, for instance via Newton's method, so that ζ can always be chosen as the independent parameter of our RS equations (as it is always known).We can then solve the following surrogate set of RS equations: where now (e.g. via damped fixed point iteration).When simulating the data, Λ 0 (.) is known and, empirically, we find that substituting equation (38) with its equivalent version, obtained by Gaussian integration by parts of (25), leads to much faster convergence with smaller population size and hence the latter is adopted instead of (38).

Simulations tests of RS theory
When some observations are censored, it is known that in the proportional asymptotic regime the ML estimator βML does not always exists [28], and a recent paper [10] established that, asymptotically and under the assumption of Gaussian covariates, βn undergoes a sharp phase transition at some critical value ζ c .Here we focus on the region ζ < ζ c where the ML estimator does exist.Since one obviously cannot solve the RS equations for all possible choices of the hazard function λ 0 (.), the censoring rate λ c (.) and the true association amplitude S = |A 1/2 β 0 |, we have in our simulations chosen a censoring distribution p c (t) that is uniform between 0 and t max , reflecting the realistic scenario of a clinical trial of duration t max where patients are recruited at constant rate for the trial duration.Different choices of S, t max and true hazard rate   data for cumulative hazard Λ 0 (t) = 1 2 t 2 .In all panels we plot with markers and error bars the simulation results, averaged over 500 independent simulations with distinct data realizations, and with solid lines the predictions κ⋆ and v⋆ of the RS theory (solved with populations of size m = 10 6 ).λ 0 (.) will lead to different expected fractions ⟨∆⟩ of true (non-censored) events: To test the predictions of our RS equations we generated 500 survival data-sets D of size n = |D| = 400, with uniformly distributed censoring on the interval [0, 4], using cumulative hazards of the Log-logistic Λ 0 (t) = log(1 + t 2 ) and Weibull Λ 0 (t) = 1 2 t 2 form.We used uncorrelated unit-variance Gaussian covariates, and true association vector β 0 = ê1 (i.e. the first unit vector), hence S = 1.In all cases ⟨∆⟩ ≈ 0.6, so around 40% events are censored, for both choices of Λ 0 (.).We then applied Cox's ML regression protocol to infer the associations and the cumulative hazard (via Breslow's formula), giving the estimators βn and Λn (.).From βn we subsequently computed the overfitting markers (κ n , vn ) in (31,32).
In Figure 1 we plot the inferred cumulative hazards Λn (.) versus the true cumulative hazards Λ 0 (.), which can be done unambiguously since both are always non-decreasing functions of time.By the nature of the Breslow estimator (7), these curves take the form of 'staircase' functions.We also show in the same panels the RS prediction of the relation between the two quantities, as red dashed lines.We conclude from Figure 1 that in all cases the solution of the RS equations correctly predict the typical values of Λn (.).In Figure 2 we compare the replica predictions (κ ⋆ , v ⋆ ) (solid lines) with the corresponding measurements (κ n , vn ) (markers with error bars), for different values of ζ = p/n ∈ [0, 0.5].Markers give the averages over the 500 data set realizations; error bars indicate the standard deviations.Again we observe excellent agreement between the RS theory and the simulations, in spite of the relatively small sample size n = 400.The latter feature is important for applications of the theory, since real data sets in survival analysis indeed often have sizes of that order.

De-biasing protocols for overfitted ML estimators
Given the agreement between theory and simulations, we now explore the potential of using our theory as a systematic tool with which to decontaminate ML estimators and build asymptotically unbiased estimators βn and Λn (.).Two obstacles appear initially to hinder direct application of our RS equations for bias decontamination.First, our equations involve the amplitude S (which is not directly accessible), Second, bias decontamination requires inverting the complex relation between the inferred and true integrated hazards.We have already constructed an algorithm for computing Λn (.) in terms of Λ 0 (.), but now we need to compute Λ 0 (.) given Λn (.).In this section we remove both obstacles, and show that the RS theory can indeed be used for creating accurate unbiased estimators in the overfitting regime.

Derivation of a practical algorithm for de-biasing
Given the assumptions of our theory, we may write the marginal outcome probability p(t, ∆) = dz p(z)p(t, ∆|β 0 •z) as Hence the log-marginal likelihood density for n i.i.d.observations where we introduced the function ϕ ∆ (x, s) = Dy e ∆sy−x exp(sy) . (45) We next compute the ML estimators for (λ, λ c ) by maximization of (44).Taking the functional derivative with respect to λ(.) gives, after standard manipulations, the following estimator for the hazard rate: Upon integrating both sides over t, and replacing in the right-hand side the (as yet unknown) value of Λ ( t i ) by the estimator Λn (t|S) = t 0 dt ′ λn (t ′ |S) one then finds the following simple fixed-point equation for Λn (.|S), that does not involve any parameter that could be susceptible to overfitting: .
The result of solving this fixed-point equation by (damped) iteration is remarkably accurate, as will be shown below.Repeating the same steps for the censoring rate λ c (.) gives the unbiased ML estimator One could in principle attempt to determine S in the same way: extremize (44) with respect to S, and solve the resulting equation simultaneously with (47) for Λ(.|S) and S. Unfortunately, the resulting equations admit the trivial solution S = 0, describing the trivial situation where the outcomes (t, ∆) are independent of the covariates.Thus we need an independent extra equation to determine the value of S, which must be added to and solved simultaneously with the RS order parameter equations and (47).We first note that (31,32) imply with p/n = ζ fixed as p, n → ∞.If the covariate correlation matrix A is known, the above could serve as the desired extra equation.If the covariate correlation matrix is not known, as would generally be the case, we can use the following result, which exploits the fact that in regression the n quantities βML •z i are always observable: This identity is derived in Appendix B. It uses explicitly the alternative form of the replica calculation followed in the present paper, from which followed equation ( 26) (a result that could have been but was not derived in earlier papers).In contrast to (49), (50) can always be used as a additional equation with which to determine S. The final result is the following algorithm, that seeks to combine optimally the RS theory with the available data D and the biased ML estimators.After measuring the left-hand side of (50), one solves numerically the four scalar equations (23,24,25,50) simultaneously for (u, v, w, S), e.g.via (damped) fixed-point iteration, with the shorthands (18) for the function ξ(t, ∆, y, z) and (17) for p(t, ∆|Sy).The censoring rate λ c (t) in (17) is estimated by the time derivative of (48), and for each value of S, the base hazard λ 0 (t) by the time derive of the result of iterating (47) to a fixed-point.The various Gaussian integrals can be done via Gauss-Legendre quadrature.The resulting new and unbiased estimator for the association parameters is according to (31) then given by β = βML /κ ⋆ , where κ ⋆ = w ⋆ /S. .In all panels we plot with black lines the Breslow estimator Λn(.) versus the true cumulative hazard Λ 0 (.), for 500 independent simulations with distinct data realizations.Yellow curves show the predictions of the RS theory (solved with populations of size m = 10 6 ), and red curves indicate the diagonal (that would have been found for perfect regression).We also show in blue the de-biased estimator Λ(.) = Λn(.)/κ⋆for the cumulative hazard versus the true cumulative hazard Λ 0 (.), for the same 500 simulations.

Simulation tests of the proposed new estimators
In Figure 3 we plot both the uncorrected and the de-biased estimator for the cumulative hazard, for simulations of survival data with n = 400 and around 40% of censoring events.These data show convincingly that the ML estimator Λn (.) (shown in black) is quite biased, with average values as predicted by the RS theory (dashed yellow line), but that the corrected estimator Λ(.) (shown in blue) when plotted against the true cumulative hazard is centered around the red dashed line (the diagonal, corresponding to unbiased estimation).This is remarkable, given that our de-biasing algorithm does not require any knowledge of the data generating process, apart from assuming the form of a semi-parametric proportional hazard model (4).As expected, the corrected estimator exhibits a larger variance than the ML estimator (reflecting the general and well-known bias-variance trade-off in inference).The increasing variance of both estimators at large times is simply a finite size effect: already at t = 2.5 the survival probability is below 0.1, so only few subjects are still alive and able to generate events that may contribute to hazard rate inference.Similarly we show in and for the new estimator inferred by our de-biasing algorithm (light grey), for the same data as those used in Figure 3.We also show as a vertical dashed line the location of the true value ê1 •β 0 = 1, and as a solid curve the predicted Gaussian asymptotic distribution of the new estimator, with average 1 and variance v 2 ⋆ /κ 2 ⋆ p.We observe that the new estimator indeed removed successfully the overfittinginduced bias of the ML one, while its distribution exhibits finite size effects.Figures 4,5,and 6 the differences between the ML estimators and the new de-biased estimators for the amplitude S (the 'signal strength' of the true associations) and the nonzero end zero components of the true association vector.
Our simulations support the conclusion that the new decontamination algorithm proposed above indeed leads to virtually unbiased estimators for Λ 0 (.), S , and β 0 , even for relatively small data sets and in the presence of significant levels of censoring.This is a nontrivial result, since we have only required that there is no model mismatch, i.e. that the data were generated from a Cox model, and that the distribution of the risk factors β 0 • z i will asymptotically be Gaussian.In particular, the algorithm does not require a priori knowledge of any of the parameters of the data generating process.Distributions of the values of the zero component ê2 • β of the association vector, both for the standard ML Cox regression estimator (dark grey) and for the new estimator inferred by our de-biasing algorithm (light grey), for the same data as those used in Figure 3.We also show as a vertical dashed line the location of the true value ê2 •β 0 = 0, and as a solid curve the predicted Gaussian asymptotic distribution of the new estimator, here with zero average and variance v 2 ⋆ /κ 2 ⋆ p.Here, as expected, already the ML estimator was unbiased, since the theory predicts the overfitting-induced bias to take the form of a multiplicative factor κ⋆ (which would here just multiply the number zero).

Conclusion
In this paper we have extended previous analytical studies on overfitting in high dimensional Cox regression for time-to-event data [5,17,6], by including censoring.Censoring is a necessary ingredient for any survival analysis theory if its results are to be applied to real medical data (the main application area of survival analysis).While still using the replica method as our main tool, in addition to the inclusion of censoring we have here also chosen a different route to carry out the analysis compared to [5,17,6], which enabled us to derive an explicit formula for the inferred distribution of risk factors.This latter formula, in turn, paved the way for the creation of new algorithms for applications of the replica-symmetric (RS) theory, that previously required additional knowledge and/or additional approximations.All our new theoretical results and predictions are validated using numerical simulations.
We consider the main deliverables of the present paper to be the following three.First, we now have an accurate theory with which to understand quantitatively the impact of censoring on overfitting phenomena in the Cox model, so that replica-based overfitting analysis can now be applied to realistic real-world problems in medicine (i.e. to data from clinical studies of finite as opposed to infinite duration).Second, we are now able to solve directly and accurately the RS equations for the inferred cumulative hazard, instead of using the variational approximations proposed in [5,17,6].Thirdly, we have been able to construct from the present theory a practical and accurate algorithm with which to decontaminate inferences for overfitting-induced bias, that does not require knowledge of any of the parameters of the true data generating distribution.The latter algorithm combines in a very effective way the RS equations with those quantities that are measurable in ML regression, and infers the effective signal strength S = |A 1/2 β 0 | as a by-product.
The results presented here thus have not only a theoretical but also a very practical value.The availability of tools for de-biasing the important estimators of the Cox model implies that one can now achieve significantly improved outcome predictions in realistic clinical scenarios, compared to what would have been achievable with ML regression alone, either by using more covariates for prediction (i.e.increasing ζ = p/n by increasing p), or by using fewer patients to do so (i.e.increasing ζ = p/n by decreasing n), or by allowing for increased levels of censoring without detrimental consequences.To the best of our knowledge, the presently proposed bias decontamination algorithm, that corrects both association parameters and nuisance parameters (i.e. the cumulative hazard in the Cox model), is the first of its kind.
We envisage future work to involve application of the new estimator decontamination algorithm to real (partially censored) survival data, finding an expression or equation for the transition point ζ c of the Cox model in the presence of censoring, and working out the RS theory upon including ridge penalization in the Cox formulae for survival analysis with censoring.We also envisage generalizing the new decontamination algorithm to more general GLM regression models, where the efficient computation of the signal strength S, of other possible parameters of the true data generating model, and of the nuisance parameters, have in the past always required unwelcome approximations that we expect will now no longer be necessary.

Figure 1 .
Figure 1.Comparison of theoretical predictions and simulation data, for Cox's survival analysis model with uniform censoring on the interval [0, 4], giving around 40% censoring events, and data set size n = 400.Top row: p = 100 (so ζ = 0.25); bottom row: p = 200 (so ζ = 0.5).Panels (a,c): cumulative hazard Λ 0 (t) = log(1+t 2 ).Panels (b,d): cumulative hazard Λ 0(t) = 1 2 t 2 .In all panels we plot with black lines the Breslow estimator Λn(.) versus the true cumulative hazard Λ 0 (.), for 500 independent simulations with distinct data realizations.Yellow curves show the predictions of the RS theory (solved with populations of size m = 10 6 ), and red curves indicate the diagonal (that would have been found for perfect regression).Further details are given in the main text.

Figure 2 .
Figure 2. Comparison of theoretical predictions and simulation data, for Cox's survival analysis model with uniform censoring on the interval [0, 4], giving around 40% censoring events, and data set size n = 400.Top row: overfitting-induced bias factor κn versus ζ; bottom row: overfitting-induced noise amplitude vn versus ζ.Panels (a,c): data for cumulative hazard Λ 0 (t) = log(1 + t 2 ).Panels (b,d): data for cumulative hazard Λ 0(t) = 1 2 t 2 .In all panels we plot with markers and error bars the simulation results, averaged over 500 independent simulations with distinct data realizations, and with solid lines the predictions κ⋆ and v⋆ of the RS theory (solved with populations of size m = 10 6 ).

Figure 4 .
Figure 4. Distributions of the values of the effective signal strength S = |A 1/2 β 0 | inferred by our de-biasing algorithm, for the same data as those used in Figure 3.In all panels the maximum of the histogram corresponds to the true value S = 1, in spite of the relatively modest data set size (around 240 non-censored events).

Figure 5 .
Figure 5. Distributions of the values of the non-zero component ê1 •β of the association vector, both for the standard ML Cox regression estimator (dark grey)and for the new estimator inferred by our de-biasing algorithm (light grey), for the same data as those used in Figure3.We also show as a vertical dashed line the location of the true value ê1 •β 0 = 1, and as a solid curve the predicted Gaussian asymptotic distribution of the new estimator, with average 1 and variance v 2 ⋆ /κ 2 ⋆ p.We observe that the new estimator indeed removed successfully the overfittinginduced bias of the ML one, while its distribution exhibits finite size effects.
Figure 6.Distributions of the values of the zero component ê2 • β of the association vector, both for the standard ML Cox regression estimator (dark grey) and for the new estimator inferred by our de-biasing algorithm (light grey), for the same data as those used in Figure3.We also show as a vertical dashed line the location of the true value ê2 •β 0 = 0, and as a solid curve the predicted Gaussian asymptotic distribution of the new estimator, here with zero average and variance v 2 ⋆ /κ 2 ⋆ p.Here, as expected, already the ML estimator was unbiased, since the theory predicts the overfitting-induced bias to take the form of a multiplicative factor κ⋆ (which would here just multiply the number zero).