Hierarchical Bayesian pharmacometrics analysis of Baclofen for alcohol use disorder

Alcohol use disorder (AUD), also called alcohol dependence, is a major public health problem, affecting almost 10% of the world’s population. Baclofen, as a selective GABAB receptor agonist, has emerged as a promising drug for the treatment of AUD. However, the inter-trial, inter-individual and residual variability in drug concentration over time in a population of patients with AUD is unknown. In this study, we use a hierarchical Bayesian workflow to estimate the parameters of a pharmacokinetic (PK) population model from Baclofen administration to patients with AUD. By monitoring various convergence diagnostics, the probabilistic methodology is first validated on synthetic longitudinal datasets and then applied to infer the PK model parameters based on the clinical data that were retrospectively collected from outpatients treated with oral Baclofen. We show that state-of-the-art advances in automatic Bayesian inference using self-tuning Hamiltonian Monte Carlo (HMC) algorithms provide accurate and decisive predictions on Baclofen plasma concentration at both individual and group levels. Importantly, leveraging the information in prior provides faster computation, better convergence diagnostics, and substantially higher out-of-sample prediction accuracy. Moreover, the root mean squared error as a measure of within-sample predictive accuracy can be misleading for model evaluation, whereas the fully Bayesian information criteria correctly select the true data generating parameters. This study points out the capability of non-parametric Bayesian estimation using adaptive HMC sampling methods for easy and reliable estimation in clinical settings to optimize dosing regimens and efficiently treat AUD.


Introduction
Alcohol use disorder (AUD) is a major public health problem, affecting almost 10% of the world's population (Schuckit 2009). The harmful use of alcohol is responsible for 5.1% of the global burden of disease, resulting in more than 200 diseases, injuries and other health conditions, and contributing to 3 million deaths globally each year, i.e. 5.3% of all deaths according to World Health Organization (2022). Beyond health consequences, this brain disorder brings significant mental and behavioral symptoms, as well as social and economic losses to individuals and society at large.
Baclofen is a selective agonist of the Gamma-Aminobutyric Acid type B receptors (GABA B ), which may exert an inhibitory effect on dopaminergic neurons (Doherty and Gratton 2007). It was originally marketed as a muscle relaxant for the treatment of neurological-induced spasticity. Various experiments have shown a positive effect of Baclofen on AUD in rodents and non-human primates (see Colombo et al 2018 for a review). Although there is no clear evidence on the dosing, efficacy, safety, and ideal duration of Baclofen treatment for AUD, clinical studies have shown its promise as a medication for patients with moderate to severe AUD (Agabio et al 2002, Garbutt et al 2021). Preclinical experiments have also shown its efficacy in reducing alcohol withdrawal syndrome (Brennan et al 2013) and voluntary alcohol consumption (Colombo et al 2004). The optimal dosage of Baclofen varies according to individuals and has not been well established with a general agreement. It is mainly prescribed to patients for whom the existing treatments have not been effective, or in those with liver disease due to its minimal damage to the liver. See de Beaurepaire et al (2019), for a review of the use of Baclofen as a treatment in human AUD. Although in recent years, Baclofen has been used to reduce craving, voluntary alcohol intake and withdrawal syndrome of alcoholic patients, a wide inter-individual variability has been observed, and the potential high risk of sedation is unknown (Imbert et al 2015, Simon et al 2018. Thus, the precise efficacy of dosing and the ideal duration of treatment in different AUD patient groups need to be evaluated. Mathematical modeling is widely used in many areas of science to learn about the data generation process, make predictions on outcomes (unseen data), and to justify hypotheses. In the modeling framework, differential equations provide us with the natural evolution of the system under study at any given time point. They are often used to describe the principles that govern the dynamics of the system, allowing us to adequately describe the processes involved and make quantitative predictions, such as the rate of change of drug concentration and clinical efficacy. In this study, we use a population pharmacokinetic (PK) model of the Baclofen effect on patients with AUD to estimate the variability in drug concentration over time in a population of patients. The PK analysis has been widely used in clinical development, with several applications, for instance, to improve the understanding of the in vivo behavior of complex delivery systems, as it allows for the separation of drug-, carrier-, and pharmacological system-specific parameters (Sheiner and Ludden 1992, Meibohm and Derendorf 1997, Bonate 2011, Upton and Mould 2014. Analysis of PK model is currently an indispensable component of drug discovery, by ensuring that robust evidence from preclinical models closely shapes the design of clinical studies (Zou et al 2020). Although PK modeling has facilitated the drug development process, the accurate and reliable estimation of its parameters from noisy data is a major challenge in conducting clinical research to determine the safety and efficacy of a drug within a particular disease or specific patient population. To estimate unknown quantities, optimization methods within the frequentist approach are often used in practice, in which an objective (or cost) function is defined to score the performance of the model by comparing the observed and predicted values. However, such a parametric approach results in only a point estimation, and the optimization algorithms may easily get stuck in a local maximum, requiring multi-start strategies to address the potential multi-modalities. Moreover, the estimation depends critically on the form of the objective function defined for optimization, and the models involving differential equations often have non-identifiable parameters (Hashemi et al 2018).
In this study, we use the Bayesian approach to address these challenges in the estimation of a population PK model's parameters from synthetic data (generated by known values for validation purposes), and then from routine clinical data that were retrospectively collected from 67 adult outpatients treated with oral Baclofen. The Bayesian framework is a principled method for inference and prediction with a broad range of applications, wherein the uncertainty in parameter estimation is naturally quantified through probability distribution placed on the parameters, which are updated with the information provided by data (Bishop 2006. Such a probabilistic technique provides the full posterior distribution of unknown quantities in the underlying data generating process, given only observed responses and the existing information about uncertain quantities expressed as a prior probability distribution (Ferreira et al 2020, Hashemi et al 2020, 2021. In other words, Bayesian inference provides all plausible parameter ranges that are consistent with observation by integrating information from both domain expertise and experimental data. In the context of clinical trials, the frequentist approach utilizes prior information (based on evidence from previous trials) only in the design of a trial, not in the analysis of the data (Gupta 2012, Jack andChu 2012). In contrast, the Bayesian approach provides a formal mathematical framework to combine prior information with available information at the design, conduct, and analysis stages of the experiment (Spiegelhalter et al 1999, Berry 2006, Yarnell et al 2021. To conduct Bayesian data analysis, Markov chain Monte Carlo (MCMC) methods have often been used to sample from and, hence, approximate the exact posterior distributions. However, MCMC sampling in high-dimensional parameter spaces, which converge to the desired target distribution is non-trivial and computationally expensive (Betancourt 2014(Betancourt , 2017. In particular, the use of differential equations, such as PK population models, together with noise in data raises many convergence issues (Hashemi et al 2020, Grinsztajn et al 2021, Jha et al 2022. Designing an efficient MCMC sampler to perform principled Bayesian inference on high-dimensional and correlated parameters remains a challenging task. Although Bayesian inference requires painstaking model-specific derivations and hyper-parameter tuning, probabilistic programming languages such as Stan (Carpenter et al 2017) provide high-level tools to reliably solve complex parameter estimation problems. Stan (see https://mc-stan.org) is a state-of-the-art platform for high-performance statistical computation and automatic Bayesian data analysis (Gelman et al 2020), which provides advanced algorithms (Hoffman and Gelman 2014), efficient gradient computation (Margossian 2018), and numerous diagnostics to check the convergence of sampling (Vehtari et al 2021), hence, the reliability of inference and prediction.
In the present work, to estimate the full posterior distribution of a population PK model's parameters, we use a self-tuning variant of Hamiltonian Monte Carlo (HMC) in Stan. This algorithm adapts the parameters of the HMC, making the sampling strategy more efficient and automated (Hoffman and Gelman 2014). The Bayesian setting in this study, which is first validated on synthetic data according to various diagnostics, enables us to efficiently and accurately estimate the effect of Baclofen on retrospective patients with AUD. Importantly, we demonstrate that leveraging the level of information in priors provide faster computation, better convergence diagnostics, and substantially higher out-of-sample prediction accuracy. Moreover, we show that the root mean squared error as a measure of within-sample predictive accuracy can be misleading for model evaluation and selection. In constrats, the fully Bayesian information criteria (such as Watanabe-Akaike information criterion; Gelman et al 2014) correctly select the true data generating parameters. This work may pave the way to reliably predicting the efficacy of treatment drugs from longitudinal patient data in a fully Bayesian setting, thus optimizing strategies for clinical decision-making, especially in brain disorders.

Clinical data
In this study, we used the routine clinical data retrospectively collected from 67 adult outpatients (43 men and 24 women, with weights ranging from 42 kg to 128 kg and ages ranging from 29 to 68 years old), with AUD treated with oral Baclofen, in the Department of Addictology at Sainte Marguerite Hospital in Marseille, France. Ethics committee approval and patient consent were not compulsory in France for the use of retrospective therapeutic drug monitoring data, before Loi Jardé (n o 2012-300 du 5 Mars 2012 relative aux recherches impliquant la personne humaine). Our data were collected before 2012, thus they are retroactively considered as Hors Loi Jardé. In October 2018, France became the first country to officially approve Baclofen for AUD (Rolland et al 2020). The French Medicines Agency (ANSM) has granted marketing authorization for Baclofen to support drinking reduction in AUD, which is a major and insufficiently addressed public health issue, with more than 60 000 patients treated with Baclofen in France (Rolland et al 2020). The data has been thoroughly described and published in a previous study by Imbert et al (2015).
In brief, Baclofen was administered orally in conventional tablet form and the dosing schedule consisted of a tailored and variable dose for each patient. All patients met the DSM-5 criteria for AUD (American Psychiatric Association, 2013). Therapeutic drug monitoring enabled the measurement of plasma concentrations. The treatment was conducted similarly for all patients, that is, doses were progressively increased (subject to a good tolerance to the treatment) from an initial dose of 15 mg/d (5 mg three times per day [t.i.d]) for 1 week to reach the target dose of 180 mg at 3 months (representing a Baclofen increase of 15 mg every week). The clinical event schedule (dosing and measurements) of each subject is specified according to the Nonlinear Mixed Effects Modeling software (NONMEM, Beal et al 2009) conventions. The number of observations is subject-dependant, with a total of 427 recordings. For each patient, the number of measurements is comprised of between 1 and 16 observations (median = 6). The total follow-up time ranges between 1 and 613 days (average is 185 days). The following data were collected for each patient: drug administration data with the patient's dosage regimen of Baclofen, serum concentration of Baclofen with the date and time of each measurement, demographic data (age, body weight, height, gender), biological data (creatinine, urea, hepatic transaminases aspartate aminotransferase/alanine aminotransferase, albumin, prothrombin ratio, fibrinogen, mean corpuscular volume, bilirubin, alkaline phosphatase, carbohydrate-deficient transferrin, gamma-glutamyl transferase) and alcohol and tobacco consumption. The planned visits were every week the first month then at months 2 and 3 following the treatment initiation. At each visit, routine psychological support counseling was provided by the same professional staff and craving level for alcohol was assessed using the obsessive compulsive drinking scale (OCDS). According to a one-compartment PK population model with first-order elimination (ADVAN2 TRANS2), none of the covariates tested was able to improve the fit using an exponential model, or decrease the intersubject variability and the base model, and previous analysis showed no formal bias (Imbert et al 2015).

Synthetic data
Using synthetic data for fitting allows us to validate the inference process as we know the ground-truth of the parameters being estimated. Therefore, we can use standard error metrics to measure the similarity between the inferred parameters and those used for data generation. Synthetic data were generated following the same event schedule as empirical clinical data and using a one-compartment population model with first-order absorption (see equation (1)). The synthetic data was generated using the R package mrgsolve (Baron 2022), which enables simulation from ODE-based hierarchical models, including random effects and covariates. The mrgsolve package (Elmokadem et al 2019) uses Livermore Solver for Ordinary Differential Equations (LSODE; Radhakrishnan et al 1993) of the Fortran ODEPACK library (Hindmarsh 1992) integrated in R through the Rcpp (Eddelbuettel and Francois 2011) package.

Population model for pharmacometrics
PK models are currently used to predict clinical response in humans through understanding the mechanism of action and disease behavior for a given drug. Inference on the PK model from data can optimize the design of clinical trials, provide an appropriate dose range, anticipate the effect in certain subpopulations, and better predict drug-drug interactions. The PK population models invoke nonlinear mixed-effects models that allow for the quantification of exposure-response relationships on both a global population and individual level (Bonate 2011, Owen and Fiedler-Kelly 2014, Keizer et al 2018. Such a modeling approach provides a flexible tool for hypothesis testing and refining assumptions, e.g. the random phenomena underlying the exposure-response mechanism by including group-specific (in this case, subject-specific) effects within a population. This model also allows for the inclusion of covariates whose influence on inter-group variations can be inferred.
In the present work, drug concentration in the plasma is modeled by a one-compartment model with first-order absorption The above linear ODE system describes the evolution of the amount y of the drug in both dosing compartment 4 and the central compartment (plasma). The first-order kinetics parameter k a (h −1 ) is the absorption constant of the drug. The parameter V (L) is the volume of distribution, which represents the hypothetical volume in which the drug would need to be diluted to the same level as in plasma. The clearance CL (L/h) controls the elimination of the drug from the organism, and such a linear ODE system has an analytical solution, as we used in this study. Let y = (y dosing , y central ) T , the linear system given by equation (1) can be expressed in matrix form (2) Solutions of this homogeneous linear system at time t are given by where y 0 = y(0) is the initial condition and e tA denotes the matrix exponential 5 .

Bayesian modeling
In Bayesian modeling, all model parameters are treated as random variables and the values vary based on the underlying probability distribution (Bishop 2006). That is, in the case of a one-compartment PK model, kinetic parameters CL, V and k a will be interpreted as random variables, and we aim to infer their probability distributions based on prior knowledge (e.g. derived from physiological information or previous evidence), updated with available information in the observed data through the so-called likelihood function, i.e. the probability of some observed outcomes given a set of parameters. Although the likelihood function can provide the best-fit points (maximum likelihood estimators), we are interested in the whole posterior distribution over unknown model parameters, as it contains all relevant information about parameters after observing data to perform a reliable model selection . Formally, given data y and model parameters θ, Bayes rule gives the posterior distribution as that combines and actualizes prior knowledge on parameters (before seeing data) with knowledge acquired from observed data (through likelihood function). The denominator p(y) =´p(y | θ)p(θ)dθ represents the 4 For oral dosing, doses are added in the dosing compartment; infusion and IV bolus injections must be added directly in the central compartment. 5 Matrix exponential e tA is defined as e tA = ∑ ∞ k=0 (tA) k k! . Stan features a matrix exponential function matrix_exp.
probability of the data and it is known as evidence or marginal likelihood. In the context of inference, this term amounts to simply a normalization factor, thus equation (4) is reduced to a proportionality relation The Bayesian approach applied to PK analysis provides a fully probabilistic description of unknown quantities, allowing not only a straightforward interpretation of the inferred parameters and outcomes, but also the modeling of uncertainty about the inferred values of these quantities. Moreover, it provides us with a principled method for model comparison, selection, and decision-making by measuring the out-of-sample model predictive accuracy (i.e. the measure of the model's ability in new data prediction). This metric evaluates the generalizability of a model, as it provides an estimate of how well the model is likely to perform when used to make predictions on unseen data. To assess the out-of-sample predictive accuracy, we used Watanabe-Akaike (or widely available) information criterion (WAIC; . This metric is a fully Bayesian information criterion that utilizes the entire posterior distribution, thus allowing us to incorporate our prior knowledge into the model selection. Following , WAIC is given by: where lppd is the log pointwise predictive density for a new data point (as the accuracy term 6 ), and p eff is the effective number of parameters (as penalty term to adjust for overfitting) 7 . In practice, we can replace the expectations with the average over the draws from the full posterior to calculate WAIC (for more details see . Note that WAIC uses the whole posterior distribution rather than the point estimation used in the classical information criteria such as AIC and BIC. Finally, the relative difference in WAIC is used to measure the level or the strength of evidence for each candidate model under consideration. The lower value of WAIC indicates a better model fit. Following Burnham and Anderson (2002), Hashemi et al (2021), a relative difference larger than 10 between the best model (with minimum WAIC) and other candidate models indicates that an alternative model is very unlikely; that is, an alternative model has essentially no support.

MCMC sampling
Monte Carlo (MC) sampling is a family of computational algorithms used for uncertainty quantification by drawing random samples from distributions, in which the sampling process does not require knowledge of the entire distribution. Markov chain Monte Carlo (MCMC) methods construct Markov chains (sequences of probabilistically linked states, with the probability of transitioning to a given state depending only on the current state) that have a desired stationary probability distribution, to extract samples from this distribution . Using MCMC, the samples are obtained by retrieving the history of states visited by the chain; if the chain is run for long enough, these samples will originate from the stationary distribution (independent of the initial states). From these samples, we can construct MC estimates describing the target probability distribution, such as mean and quantiles; the sample average approximates an expectation with respect to the stationary distribution of the chain. In this context, Monte Carlo standard errors (MCSE) provide an indication of the quality of reported estimates, that is, a quantification of the estimation noise. The MCSE of an estimatorθ n is given by the posterior standard deviation divided by the square root of the effective sample size (ESS). In addition to sample mean and sample standard deviation, MCMC methods provide the estimated error (standard deviation) in the posterior mean estimation, on the scale of the parameter value (Carpenter et al 2017).

Adaptive HMC sampling using Stan
Stan (https://mc-stan.org) is an open-source statistical tool that implements automatic gradient-based algorithms for Bayesian modeling and probabilistic machine learning (Carpenter et al 2017). Hamiltonian Monte Carlo (HMC) is a powerful MCMC method that uses the derivatives of the density function being sampled to generate efficient transitions that explore the whole posterior distributions in complex probabilistic models (Duane et al 1987, Neal 2011. However, the performance of HMC is highly sensitive to the algorithmic parameters, such as the step size and the number of steps in the leapfrog integrator. Stan provides a self-tuning variant of the HMC, the so-called No U-Turn Sampler (Hoffman and Gelman 2014), which uses a recursive algorithm to eliminate the need for setting the hyperparameters, i.e. it adaptively tunes the HMC algorithm without requiring the user intervention Gelman 2014, Betancourt 2017). The NUTS algorithm avoids random walk behavior and sensitivity to correlated parameters to construct efficient Markov chain exploration of the distribution's typical set, that is, the set that concentrates the majority of the volume-density trade-off (Betancourt 2017). Starting from an initial value (random or defined by the user), the NUTS algorithm updates parameters θ through a series of iterations by evolving them according to Hamiltonian dynamics and submitting them to a Metropolis proposal. More precisely, HMC transforms the problem of sampling from a target distribution into the problem of simulating Hamiltonian dynamics; it artificially introduces an auxiliary momentum ρ to suppress random walk behavior, which extends the representation of the target parameter space into a phase space of joint parameters (ρ, θ). The auxiliary momentum is sampled from the conditional distribution π(ρ|θ). In Stan implementation, it is sampled from a multivariate normal distribution N that is independent of θ: with covariance matrix M estimated as the inverse of posterior covariance during a warm-up phase. The joint phase space parameters (ρ, θ) are evolved through Hamiltonian dynamics (the expression of the Hamiltonian is given in appendix A.1) by solving the Hamiltonian equations of motion (see appendix A.2). To do so, Stan uses the leapfrog integrator (see appendix A.3). The output of the integrator (ρ * , θ * ) is then submitted as a move proposal inside the Metropolis algorithm and is either accepted or rejected with a given probability and the system is evolved accordingly. The process is repeated iteratively starting with the momentum (re)sampling step (equation (7)).

Convergence of MCMC sampling
After running a class of MCMC algorithms, it is necessary to monitor the convergence of the samples. This can be carried out in different ways, including traceplots (showing the evolution of parameter estimates from MCMC draws over the iterations), pair plots (to identify collinearity between variables), and autocorrelation plots (to measure the degree of correlation between draws of MCMC samples). More quantitative metrics are also used in this study to assess the MCMC convergence, as described in the following.

Split-R
Sampling from a target distribution to estimate expectations can only be reliable if the chains have converged to the stationary distribution. The primary step to ensure the reliability of MCMC posterior chains is to check the convergence of chains with different random initializations. MCMC is asymptotically exact in the limit of infinite runs (N → ∞), however, in our finite-time settings, MCMC convergence cannot be guaranteed with a limited number of samples, thus, we have to rely on diagnostic quantities to assure convergence. Potential scale reduction statisticR (Gelman and Rubin 1992) is widely used in statistics to assess the convergence of multiple Markov chains involved in posterior sampling. TheR diagnostic provides an estimate of how much variance could be reduced by running the chains longer. TheR close to 1 (in practice, lower than 1.1) indicates good mixing of the chains; otherwise, the chains have not converged to the same stationary distribution. Computation ofR relies on between-(B) and within-(W) chain variances 8 . In particular, the marginal posterior variance is (over-) estimated as a combination of these two quantities var where N is the number of draws in each of the M chains. The potential scale reductionR is then defined aŝ Stan reports summary statistics includingR computed on the split-half chains, an additional precaution to detect non-stationarity in individual chains 9 .

Divergent transitions
The trajectories generated by numerical (leapfrog) integrator in the Hamiltonian equations may diverge, typically due to highly varying posterior curvature (such as funnel-shaped distributions; Betancourt 2017), then, the samples cannot be trusted. With symplectic integrators 10 , divergences build up very quickly, making them easy to detect. We have carefully monitored the Stan summary statistics for divergent transitions encountered during sampling.

ESS
A complication with MCMC methods is the autocorrelation within the generated samples of a chain, which is often caused by strong correlations among variables. Higher autocorrelation in chains increases the uncertainty of the estimation of posterior quantities of interest. Highly correlated MCMC samplers affect the efficiency of the sampler, as they require more samples to produce the same level of MC error for an estimate . The ESS provides an estimation of the independency of draws from posterior distributions that generated Markov chain would be equivalent to. Thus, a large enough ESS is an important factor for the reliability of the inferred quantities from posterior draws. We use the bulk-ESS and tail-ESS computed on rank-normalized posterior draws (Vehtari et al 2021) to assess the efficiency of sampling in the bulk (mean estimates efficiency), and tail (quantiles estimates efficiency) of the sampled posterior distribution, respectively. In the case of sampling with four individual chains (each with iter = 2000 and warmup = 1000), we consider a minimum ESS of 400 (Vehtari et al 2021).

Posterior predictive checks
Validating model assumptions and assessing the reliability of the model performance requires both domain expertise and the evaluation of the model's predictive performance. A posterior predictive check (PPC) is often used for model validation, i.e. generating data from the model using parameters drawn from the estimated posterior, and then comparing it with observed data. PPCs address such questions as does the fitted model and its estimated parameters generate data similar to those observed experimentally? Are the individual and population variability, such as the influence of covariates, consistently modeled? The principle of PPCs is to simulate multiple sets of data according to the predictive distribution y i ∼ y rep , i = 1, 2, . . . , N, and compare them to the observed data y, visually or quantified by summary statistics. Visual predictive checks usually depict the predicted mean and 95% confidence/credibility intervals (2.5% and 97.5% empirical quantiles of the posterior predictive draws) confronted with observed data. Note that PPCs are not predictive per se because we are not comparing predicted data to new observations, but rather we are comparing it to the same data that the model was fitted with. Such a process involves the double use of the data and should not be confused with validation prediction, which is evaluated on a separate validation dataset or unseen data (e.g. using WAIC). Systematic discrepancies between model predictions and data would indicate an inadequate model predictive power, and in this case, could be investigated to identify model failures.

Solving ODE systems in Stan
Linear ODE systems have the virtue of being analytically solvable, which not only provides an exact solution but is also very computationally efficient. The one-compartment linear ODE system (equation (1)) was solved analytically as provided by the Torsten library ready-to-use solver. Torsten package  is based on Stan software (v2.27.0) and provides functions that facilitate the analysis of pharmacometrics data . The library handles clinical event schedule data written in NONMEM (Beal et al 2009) conventions and the computation of steady-state dosing 11 . A steady-state is reached when the quantity of drug eliminated matches the quantity of drug that reaches the systemic circulation. In the context of a repeat-dose regimen, steady-state concentrations of drugs in the blood will rise and fall according to a repeating pattern as long as the dosing schedule remains unchanged. Torsten also offers specific functions for one-and two-compartment models for which it implements analytical solutions using matrix exponential in the Stan language. The model simulation and parameter estimation were performed on a Linux machine with 3.60 GHz Intel Core i7-7700 and 8 GB of memory. (B) One-compartment-simulated (in blue) and empirical (in red) data corresponding to plasma drug concentration (mg/L) for a single individual, and additional simulated data (in green) from the same dosing schedule.

Bayesian inference on synthetic data
In order to validate the Bayesian workflow, a synthetic dataset is generated from a one-compartment population model with first-order absorption, based on the same dosing schedule as the clinical setting, and with population PK parameters as CL pop = 7 (L/h), V pop = 65 (L), k apop = 2.5 (h −1 ). Inter-individual variations ω = (ω CL , ω V , ω ka ) are equally set to 0.3, and residual intra-individual variation σ is set to 0.1. An example of simulated drug concentration data compared to empirical observation for a single individual is shown in figure 1.
Bayesian inference of the unknown parameters of the underlying data generating process depends on both the observed responses and prior information about the underlying generative process. For the generated dataset, we aim to explore the actual sensitivity of the inference process to the level of information encoded in the prior. To do so, we consider two different prior specifications on population parameters CL pop , V pop and k apop , while the prior distributions placed on other parameters remain identical. Inter-individual variations are modeled by a variance-covariance matrix Ω, in which for computational efficiency in MC sampling, it is decomposed as with L (resp. L ′ ) as the Cholesky factor (resp. its transpose) of a correlation matrix R = LL ′ , and ω = (ω CL , ω V , ω ka ) as the vector of standard deviation of PK parameters. Here, L is assigned with a Lewandowski-Kurowicka-Joe (LKJ; Lewandowski et al 2009) prior distribution (a distribution over the set of correlation matrices or their Cholesky factor) with uniform density over all Cholesky factors of correlation matrices, whereas ω is assigned with a half-normal prior with zero mean and a standard deviation of 0.1: Individual parameters (normalised without covariate influence) denoted by Given the parameters θ i for the ith individual, the virtual amount of drug in both absorption and plasma compartments is computed by analytically solving the one-compartment ODE system given by equation (1). Here, we employed a ready-to-use one-compartment model solver provided by the Torsten package for event schedule specified data . Estimation of drug concentration in the blood for individual i, with unknown C i underlying observations C i [obs], is given by Given the above formulation, we compare two classes of a priori information on the population PK parameters. In the first set-up (denoted by model I), we place a weakly informative prior distribution centered around the true value of the parameters: Table 1. Summary statistics and convergence diagnostics for Bayesian inference on synthetic data (figure 1), using weakly informative priors (given by model I) and diffuse priors (given by model II), averaged over four HMC chains. Overall, the Bayesian model I with a higher level of information in prior, yields faster computation, better convergence diagnostics, and substantially higher prediction accuracy.
In the second set-up (denoted by model II), we place more diffuse (uninformative) priors, but still they are centered on true value for parameters V pop and k apop : To perform inference, prediction, and model comparison, we ran four individual chains with NUTS default setting in Stan (e.g. 2000 discarded warm-up iterations and 2000 saved sampling iterations, an expected acceptance probability of 0.85, and a maximum tree-depth of 10). In the following, we first carefully monitor the convergence of Markov chains, then we show the estimated posterior distributions and PPCs for population PK parameters. Finally, we demonstrate model comparison using WAIC to reliably select the model that best explains the data.

Sampling diagnostics
Sampling properties of both Bayesian models with weakly informative and diffuse priors (given by model I and model II, respectively) are overall satisfactory (see table 1). Neither of the models exhibited sampling pathologies such as divergent transitions or reaching maximum tree-depth. The values of the potential scale reduction factor areR < 1.1 for all parameters in both Bayesian models, indicating that the Markov chains have converged to a stationary distribution. When considering a more exigent threshold, thenR < 1.01 for all parameters of model I, except those relative to the 53th individual. More precisely, the individual parameter which partly controls the individual variation for the 53th individual and as a result biases the PK parameters and quantities (estimated and predicted concentration) for this individual. The Bayesian model II encounters more convergence issues with more individuals (16th, 44th, 53th, 64th). This model exhibits around 9% of its total 418 parameters that do not fall under the threshold, compared to 2.1% for the other model (see table 1). The Bayesian model I also exhibits a better convergence according to HMC specific diagnostics compared to model II (see table 1): the self-tuning step size of the leapfrog integrator is larger, and has fewer integration steps, resulting in faster computational time (around 1.5 order of magnitude), and more efficient sampling. Towards a principled Bayesian workflow (Gelman et al 2020), we investigate more closely the convergence of In the case of good mixing, all rank plots should be close to uniform (dashed horizontal lines). It can be seen that model I yields more uniform rank plots due to better mixing of the chains. the two fitted Bayesian models using the methods and quantities proposed by Vehtari et al (2021) to address possible flaws inR-based convergence diagnostics.

Mixing of the chains
Rank plots are an alternative approach to trace plots for visual sanity checking of convergence in posterior chains (Vehtari et al 2021) as shown in figure 2. For comparison to raw trace plots of the chains for the population PK parameters, see appendix figure B.1. Rank plots are histograms of posterior draws, ranked over all four chains, and plotted for each chain separately. If all chains are targeting the same posterior distribution, the rank histograms should not deviate significantly from uniformity, which is expected with a higher ESS. Figure 2 illustrates rank plots for population PK parameters using both Bayesian model I and model II. From this figure, we can see that a higher level of information in model I yields more uniform rank plots due to better mixing of the chains.

ESS
We have carefully monitored the ESS as an indicator of the number of independent draws, which affects the uncertainty of the estimation. We investigate the ESS for the three population PK parameters using the computation of split-R and ESS on rank-normalized (normal scores) samples, and the corresponding diagnostics proposed by Vehtari et al (2021). They have observed that the convergence of the chains is not necessarily uniform over the entire distribution of a parameter of interest. In addition to rank-normalized based ESS (or bulk-ESS), they proposed a tail-ESS to evaluate convergence in extreme quantiles.
In figure 3, we illustrate the checks of efficiency of quantiles and small-interval probabilities, as well as the efficiency evolution with the number of iterations. The Bayesian model I yields an overall larger ESS for the three population PK parameters, especially for parameter V pop . The Bayesian model I quantile estimates efficiency is comparatively improved compared to model II for parameter V pop , and high quantiles of k apop , the first being higher and uniform and the latter being increased in the right tail of the distribution. The parameter V pop in model II has noticeable small and decreasing efficiency in high quantiles. CL pop quantile estimates efficiency is similar in both models. In the case of a well-behaved sampling, the estimated ESSs should increase linearly with the number of iterations. This is the case for the population PK parameters in model I. However, in model II, the parameter V pop exhibits a problematic drop at half of the total iterations in tail-ESS, which indicates inter-chain scale discrepancy. Population PK parameters of both models reach an acceptable ESS over the recommended threshold of 400 when the number of draws exceeds 2000, except for pathological parameter V pop in model II. Together, these diagnostics validate that the samples in model I have better convergence to the target distribution.

Posterior behaviour
The prior and posterior distributions of model I and model II are illustrated in figure 4. It can be seen that, when using model I, the true values of all PK parameters (dashed vertical lines) are well within the support of the posterior densities, indicating that the Bayesian parameter recovery was successful. In contrast, the setting in model II failed to recover the parameters V pop and k apop . When using a wider set of priors (model II), the posterior distribution of V pop shifts towards values that are significantly lower than ground-truth, under over-influence of the data (over-fitting). A similar phenomenon is observed for the absorption parameter k a . The uncertainty of the estimation of posterior quantities (mean, MCSE of the mean, standard deviation, 95% credible intervals, and ESS) for the population PK parameters in both Bayesian models are summarised in table 2.
Visualizing the joint posterior samples in pair plots is especially useful for identifying collinearity between parameters as well as the presence of non-identifiability (banana-shaped distributions). As shown in Table 2. Comparison of sampled posterior distributions of population PK parameters using Bayesian model I and model II; sample mean, MCSE, sample standard deviation (SD), posterior 95% credible intervals and effective sample size (ESS bulk ). Parameters marked with a * have ESS below the limit of 400 recommended by Vehtari et al (2021  figure 5, model II exhibits a larger degenerate behavior in the form of a positive correlation (r > 0.7) between V pop and k apop posterior samples that is not seen in the more informative model I (r < 0.2). Such a high collinearity leads to an inefficient exploration of the posterior, which can be quantifiably observed in decreased numbers of effective samples and increasedR values. Moreover, by randomly initializing the chains, the estimation of posterior density of CL pop is more robust than that of the other two parameters. The robustness of the sampling of CL pop with respect to other parameters was also observed in the ESS plots ( figure 3). According to these results, the Bayesian set-up used in model I substantially improved the convergence of the recovery for population PK parameters.

PPCs
Posterior predictive (or retrodictive) checks are a widely-used tool for verifying the reliability of the fitted model by monitoring its predictive performance on the fitted dataset. Synthetic datasets (67 individuals, total of 427 observations) of plasma drug concentration are repeatedly drawn from the predictive distribution given by the fitted model (4 chains, each with 2000 draws). Punctual predictions (mean) and the corresponding 95% credible intervals are extracted from the draws and plotted along with the actual observations (see figure 6). For both Bayesian model I and model II, these visual checks suggest no systematic bias in the predictive distribution compared to the fitted data.

Residuals
Non-normalised residuals are calculated as differences between observed plasma concentration C obs and the estimated concentration computed by solving the one-compartment ODE system (see equation (1)), over the fitted individual PK parameters (CL i , V i , k ai ) and evaluated at observation points C obs . Monitoring the  residuals can help in pointing out issues with the estimation of PK parameters, an inadequate compartmental modeling, or ODE-solver inaccuracies. Histograms of non-normalised residuals for both model I and model II are given in appendix figure B.2. They are reasonably zero-centered (not biased), and there is no significant difference in the distributions of the non-normalised residuals of model I and model II. Note that the root mean squared error (RMSE, figure B.2, and table 1) as a within-sample predictive accuracy is slightly smaller in model II; this is consistent with the marginal posterior densities ( figure 4) where the interaction of the likelihood with more diffuse priors fails to recover the true parameters; hence, the resulting model fits the observed data even better, but at the cost of a poor representation of the true data generating process , Betancourt 2018.

Model selection
To select the best Bayesian model considered in this study for the synthetic dataset, we use the widely applicable information criterion (WAIC) as a measure of model evidence. WAIC is an indicator for the comparison of point-wise out-of-sample predictive accuracy of Bayesian models, based on the whole estimated posterior distribution. WAIC is here reported (see table 1) in the -2 log-score scale (deviance): a model with a smaller WAIC value suggests higher predictive accuracy, i.e. a better balance between accuracy and complexity, even though the number of parameters is equal for both candidate models. Our model comparison based on WAIC substantially favors model I (since ∆WAIC ≫ 10). Taken together, the Bayesian set-up used in model I substantially improved the convergence and model predictive power for estimation over population PK parameters.

Effect of noise on estimation
Lastly, we investigate the effect of the observation noise level on model parameter estimation. For varying levels of exponential residual noise, we examine the robustness of inference by monitoring HMC sampling diagnostics, PK parameter estimation, the within-sample accuracy of the fit (RMSE), and the out-of-sample predictive accuracy (WAIC). A sweep over noise values is depicted in figure B.3 for the best selected model (i.e. model I). We observed that lower residual noise levels (σ exp < 2) yield better estimation of posterior distributions of model parameters, a better RMSE between observed and fitted data, and enhanced model evidence according to WAIC measures. However, a large value of residual noise (σ exp ≥ 2) results in a significantly small predictive accuracy (i.e. the larger RMSE and WAIC values). This indicates that HMC sampling with the Bayesian set-up used in model I is robust when dealing with high levels of noise in observed data.

Out-of-distribution (OOD) PK parameters
The OOD parameters refer to situations where the observed data falls outside the range of typical or expected values. We generate biologically implausible data from unrealistic PK parameters (e.g. a volume of distribution V of 1 L), and we challenge the model's ability to infer from these data, which are unrealistic. Figure B.4 shows the predictive ability of model I measured using WAIC, when marginally changing the values of the population parameters in the data generation process. We observed a significantly lower WAIC value for plausible values of the parameters, resulting in better prediction accuracy than OOD parameters. This result indicates that the WAIC is able to effectively capture the underlying patterns and relationships in the data when the parameter values fall within a plausible range, thus establishing bounds or ranges in which the current model performs well.

Bayesian inference on clinical data
After calibrating models on simulated data, we fit model I on empirical Baclofen data. In particular, we use the set of population PK priors defined in model I. As in the previous section, we ran 4 chains of 2000 iterations each, with default settings in Stan (e.g. 2000 warm-up iterations, a target Metropolis acceptance rate of 0.85, and a maximum tree-depth of 10).

Sampling diagnostics
Stan diagnostics reported no warning on convergence, in particular, there is noR value above the threshold of 1.01, no diverging transition, and no quantity with insufficient ESS. We also rely on convergence diagnostics as shown in the investigation of models on simulated data: trace plots in figure C.1 show that the chains are mixing well; rank plots in figure C.2 are close to uniform; ESS (bulk and tail methods) in figure C.3 is large for the main three population PK parameters; paired samples of population PK parameters in figure C.4 exhibit no significant correlation.

Posterior distributions
The sampled posterior distributions of the three main population PK parameters and their respective inter-individual variability parameters are shown and compared to their prior distributions in figure 7. Statistical summary of posterior distributions of the population PK parameters is reported in table 3. We report the mean, MCSE of the mean, standard deviation, 95% credible intervals, and ESS. Interpretation of Bayesian posterior credible intervals is straightforward: a 95% CI is the central portion of the posterior that  concentrates 95% of sampled values; given the observed data, the parameter falls into the interval with probability of 95%. Distributions of sampled individual PK parameters CL i , V i and k ai are plotted for every individual in figure C.5.

PPCs
Predicted Baclofen plasma concentration datasets (N = 4 chains × 2000 draws) are generated from the estimated predictive distribution (see figure 8). Each shown data-point corresponds to an actual individual observation (in red). The PPCs are shown for all individual observations (in blue). For every individual observation, the corresponding punctual prediction (mean) and 95% confidence interval over the repeated simulations are also plotted (in light blue). No systematic deviation of predictive simulations from reality is visible, suggesting that the model's predictive distribution accurately captured the plasma concentration.

Residuals
Checking the residuals between the fitted ODE-solved estimated concentration at observation points C obs and actual observations C obs allows us to ensure that there is no bias in the estimation of the PK parameters and issues in the resolution of the underlying ODE system. The histogram of non-normalized residuals and the residuals along the time-axis are shown in figure 9, indicating that there is no biased estimation.

Covariates
Clinical data includes 18 biological covariates, which are detailed in Imbert et al (2015). The influence of a continuous covariate (denoted by COVAR) on individual parameter θ i is modeled by Searching for significant covariates on PK parameters was performed by evaluating repeated regressions (in linear and nonlinear forms of equation (15)) on HMC-fitted individual PK parameters. The statistical significance (p < 0.01) of regression coefficients was tested for covariates such as gender, body weight, age, height, creatinine serum, creatinine clearance, urea, alanine aminotransferase, aspartate aminotransferase, albumin, mean corpuscular volume, prothrombin ratio, fibrinogen, gamma-glutamyl transferase, alkaline phosphatase, carbohydrate-deficient transferrin, tobacco, and fagerstrom score. None exhibited significance, and as a result, no covariate was retained in the final model. This is consistent with the covariate selection that was performed with a NONMEM (Beal et al 2009) routine in Imbert et al (2015 and resulted in no selected covariate.

Prediction on individual dose regimen
The individualized estimation of PK model parameters allows for personalized prediction in customized dosing and temporal contexts. By placing the estimated posterior samples of the individual parameters into equation (3), we are able to predict the drug concentration in the blood under any time frame and different dose regimens, thus deriving reliable projections for optimizing the doses for individuals. Figure 10 shows examples of short-term posterior simulations for control subjects and Baclofen patients, under two different dose regimen. These simulations were obtained following a one-compartment modelling and using each individual's set of posterior samples of kinetic parameters. Note that in control cases, individual predictions

Discussion
Bayesian inference is a principled method for estimating the posterior distribution of unknown quantities given only observed responses and prior beliefs about unobserved hidden states (or latent variables). An advantage of using the Bayesian framework in the context of inference/prediction is the ability to generate not only a single point estimate (e.g. in the frequentist approach), but also full probability distributions for the quantities of interest (uncertainty quantification for decision-making process). From the latter, one can directly extract quantiles, with the possibility to answer questions such as 'what is the probability that the parameter of interest is greater/smaller than a specific value?' , with the confidence intervals in estimation. In addition, the propagation of uncertainty in the Bayesian framework provides a more robust and reliable predictive capability for the model under study, rather than point estimation with optimization methods. Importantly, the out-of-sample prediction accuracy (i.e. the measure of the model's ability in new data prediction e.g. using WAIC) enables reliable and efficient evaluation of potential hypotheses, as performed in this study. Several previous studies have used a scoring function (such as root mean square error or correlation) to measure the similarity between empirical and fitted data (Imbert et al 2015). The choice of scoring function can dramatically affect the ranking of model candidates, and, ultimately, the decision-making processes (see RMSE in table 1). Rather, we used non-parametric probabilistic mythology to analyze data, while various convergence diagnostics were carefully monitored to assess when the sampling procedure has converged to sampling from the target distribution.
Another advantage of the Bayesian approach used in this study is to integrate our prior information (domain expertise before seeing the data) into the inference process to better improve the model's ability to predict new data. For instance, it has been shown that measuring the prediction accuracy of a whole-brain network model with a higher level of information in the prior provides decisive evidence in favor of the true hypothesis regarding the degree of epileptogenicity, across different brain regions (Hashemi et al 2020, 2021. Accuracy of the predictive models is critical in determining the quality of their predictions for evidence-informed decision-making. It is important to note that the prior information is relevant in estimating the parameters rather than in calculating predictive accuracy . However, a substantial change in priors will affect the computations of the marginal likelihood, thus, the accuracy of the predictive models. An inappropriate choice of prior can lead to weak inferences and, consequently, poor predictions (see figure 4). On the other hand, a sufficient amount of information encoded through the prior distribution can provide decisive evidence in favor of the correct hypothesis, as shown in this study (see WAIC values in table 1). In addition to achieving a model with higher performance prediction accuracy, the appropriate priors can dramatically improve the exploration of the search space in terms of computational cost and algorithmic diagnostics, such as effective numbers of samples, rendering a more efficient inference process (see table 1, figures 2, 3 and 5).
MC methods allow us to sample from and, thus, approximate the exact posterior densities, without requiring knowledge of the whole distribution. HMC uses the gradient information of the posterior to avoid the undesired random walk of traditional sampling algorithms, thereby samples efficiently from posterior distributions with correlated parameters, particularly in high dimensional settings. Thanks to the open-source Stan programs (Carpenter et al 2017), adaptive HMC sampling refined the inference on population PK parameters, with the uncertainty in estimated parameters compared to point estimation in the previous work of Imbert et al (2015). In this study, we integrated domain expertise into inference process through the prior distribution to obtain the full posterior distribution that is well consistent with the data and domain expertise. For instance, prior information refined inference for the clearance parameter CL pop and its variability ω CL that are outside the 95% bootstrap confidence intervals compared to previous analysis. The posterior distributions of other inferred population PK parameters exhibit mean values that are within the previously proposed confidence intervals, and the credibility intervals extracted from their posterior distributions have been narrowed in comparison. Moreover, here the residual variance has been reduced by a factor of 10.
In this study, we estimated the effect of Baclofen on retrospective patients with AUD through Bayesian estimation of the parameters of a population PK model. Note that, retrospective studies do not involve the manipulation of variables, making it challenging to establish a cause-and-effect relationship between the exposure and the outcome. Although, they can provide important information about the potential associations between exposures and outcomes, but they cannot prove causation. This is because retrospective studies are subject to various sources of bias, including selection bias, confounding, and measurement bias. Despite these limitations, retrospective studies can still provide valuable information for generating hypotheses, guiding further research, and informing clinical decision-making. Using the prospective data, to establish a causal relationship between Baclofen's dose and exposure, and measuring the probability of cause and effect remains to be investigated in future studies.
The Bayesian approach, relying on automatic MCMC sampling algorithms used in this study provided accurate and reliable estimation of Baclofen effect, validated by posterior behavior analysis and various convergence diagnostics. This principled and probabilistic methodology enabled us to integrate the prior information in the explanation of observations to maximize the model prediction power for a given patient with AUD. This work offers proper guidance for the prediction of drug efficacy in clinical practice, bringing personalized medicine closer to reality in the treatment of brain disorders.

Data availability statement
The data cannot be made publicly available upon publication because they contain sensitive personal information. The data that support the findings of this study are available upon reasonable request from the authors.