Network Meta-Analysis: A Statistical Physics Perspective

Network meta-analysis (NMA) is a technique used in medical statistics to combine evidence from multiple medical trials. NMA defines an inference and information processing problem on a network of treatment options and trials connecting the treatments. We believe that statistical physics can offer useful ideas and tools for this area, including from the theory of complex networks, stochastic modelling and simulation techniques. The lack of a unique source that would allow physicists to learn about NMA effectively is a barrier to this. In this article we aim to present the `NMA problem' and existing approaches to it coherently and in a language accessible to statistical physicists. We also summarise existing points of contact between statistical physics and NMA, and describe our ideas of how physics might make a difference for NMA in the future. The overall goal of the article is to attract physicists to this interesting, timely and worthwhile field of research.


Introduction
Physicists in general, and statistical physicists in particular, have a propensity to draw inspirations from problems across the borders of traditional disciplines. The application of ideas and methods from physics to questions in biology, economics and the social sciences is therefore well established [1,2]. The following quote by the late Dietrich Stauffer encapsulates this [3]: 'The basic theorem of interdisciplinary research states: Physicists not only know everything; they know everything better .' 3 Arguably, not all of these invasions into the territory of other disciplines are useful, and physicists have been criticised for their, at times, ill-informed attempts to address questions outside their area of expertise [4]. On the other hand, it is also hard to deny that physics approaches have made useful contributions to a number of different fields.
In this perspective review we highlight network meta-analysis (NMA), a topic from medical statistics, as a field for which we think physics ideas might be useful. Metaanalysis is a statistical technique used to combine the results of multiple trials [5][6][7]. The aim of such trials is to establish and compare how effective different treatment options are. To do this, the different treatments are administered to groups of subjects in medical trials. Individual trials often have small sample sizes and involve subjects taken from a reduced population. Because of this, it is desirable to systematically integrate results from different trials to obtain an overall estimate of the effect of a given treatment and to compare treatment options. This is complicated by the fact that trials taking place at different locations will generally involve demographically different subject groups. The aggregation of data from different trials is not straightforward.
Conventional meta-analysis focuses on pairwise comparisons of treatments. More recently however, NMA (also referred to as 'indirect meta-analysis', and 'multiple' or 'mixed treatment comparison' [8,9]) has emerged as a technique for making inferences about multiple competing treatments. NMA allows one to combine data from multiple trials even when different trials test different sets of treatment options. The term 'NMA' derives from a graphical representation of the treatments and trials. The nodes of the graph are the different treatment options and the connecting edges represent comparisons made between the treatments, an illustration can be found in figure 1. NMA combines direct and indirect evidence for the assessment of treatments. This makes it possible to compare treatments that have not been tested together in any trial. For a textbook on NMA see [9].
NMA is based on two main concepts: microscopic models for the outcomes of the different trials in the graph, and algorithms or procedures to carry out the actual NMA inference. (We note that microscopic models are not actually called microscopic models in the NMA literature, but for a physics audience we think this is an appropriate description.) Microscopic models. The microscopic model captures the main assumptions made on the process leading to real-world trial outcomes. Each trial tests a subset of treatments. In the so-called 'random effects model' the relative treatment effects of the treatments in a particular trial are drawn from an underlying distribution. As a consequence, the effect of one treatment relative to another is not necessarily the same in two different trials. This reflects variability in local characteristics, for example, the fact that patient groups are chosen from different demographic subsets at different locations. Such individual or trial-level characteristics are known as 'effect modifiers' in the statistical literature.
In simulations of the random effects model probabilities for the different health outcomes are then generated for each trial arm (the treatments tested in a trial are referred to as the 'arms' of the trial). One example is the rate of success (or failure) associated with the application of a treatment in a given arm. This generates multiple layers of randomness in simulations: random treatment effects, and a distribution of The network for the 'thrombolytic drug data' set [10][11][12] comparing nine treatments for acute myocardial infarction (heart attack). The treatments T 1 , . . . , T 9 are labelled in the box. They consist of eight thrombolytic drugs and one angioplasty intervention (T 7 ). The thickness of the edges in the network indicate the number of trials making that comparison. The area of the node is proportional to the number of patients allocated to that treatment. The network consists of 50 trials; two three-arm trials (comparing T 1 , T 3 , T 4 and T 1 , T 2 , T 9 , respectively) and 48 two-arm trials. The multi-arm trials are not explicitly indicated on the graph.
health outcomes among the patients in each arm. (In the case of binary outcomes, for example, this is a binomial distribution of successes and failures.) NMA inference. The purpose of NMA is to estimate model parameters from given trial outcomes. These can either be real-world data or synthetic data (i.e. data generated in simulation studies). The NMA process also provides confidence levels for these parameters. These can be used to construct a 'ranking' of treatments, as best, second best and so on. In more sophisticated approaches, probabilities are assigned that each treatment has a particular rank, reflecting the uncertainty on inferred treatment effects. Different ranking methods are still very much under discussion.
The NMA inference itself can either be carried out in a frequentist or a Bayesian setting. In this paper we will describe the main ideas of both approaches. In Bayesian NMA (section 3) prior distributions are assumed for key model parameters, and posterior distributions are constructed from these and the trial outcomes. This needs to be done numerically, using Markov chain Monte Carlo (MCMC) methods (in NMA specifically, the Metropolis-in-Gibbs algorithm is often used [13]). In the frequentist approach one defines a linear regression model for the model parameters. The model can then be fitted using generalised least squares (GLS) regression or a maximum-likelihood approach, as explained in more detail in section 4. Other procedures are also possible [14,15], but are not discussed here.
We believe that NMA has a natural appeal to statistical physicists. Those with experience in complex networks will find it interesting to connect the structure of treatment-trial networks with the outcome of NMA. Computational physicists may contribute to optimising the inference process and required sampling methods. Those interested in stochastic simulations can naturally connect with data generation methods used to obtain synthetic data for a given network of treatments and trials. NMA is an inference problem on a networked structure, and we expect that physicists working at the border to computer science and machine learning will become excited about it; for example it is conceivable that message passing methods can become a useful tool for NMA. Our own work (with collaborators) shows that random walks on the metaanalytic network and related graphs can lead to additional insights and improve methods to establish how evidence flows in NMA [16].
One main bottleneck appears to be that there is no unique source which would allow a physicist to enter this field efficiently. While textbooks and review articles on NMA exist [8,9,[17][18][19][20][21][22][23][24], these are often written for medical practitioners, or users of existing software packages. The mathematical details are frequently suppressed, or not presented in a language physicists are used to. This can make it hard to get a good grip on the actual mechanics of NMA. This perspective review is our attempt at rectifying this. Our objective is to provide a technical introduction to NMA, accessible to physicists. We have aimed to make this self-contained, but at the same time we have tried to keep the length to a reasonable limit. We hope we have found a sensible middle ground. We necessarily had to make a selection of topics we can cover, and attempted to choose those that are most helpful for others entering this area. We also aim to point out ideas from physics which we believe to be most promising to make a difference to NMA. We hope that this will facilitate future work by the physics community in this timely and worthwhile area of research.
The paper is organised as follows: section 2 sets the scene, defines the necessary notation and states the 'NMA problem'. In sections 3 and 4 we then present the mathematical procedures used to carry out an NMA in a Bayesian and frequentist setting respectively. Section 5 summarises how the results of an NMA are reported. In section 6 we present existing analogies connecting NMA to different systems in physics, including resistor networks and random walks. In section 7 we then outline some more general connections between the two fields and speculate on ways in which we think physicists may contribute to NMA in the future. Section 8 contains a brief summary and discussion.

Networks of medical trials 2.1. General background: randomised controlled trials, meta-analysis, and network meta-analysis
In this section we first give an informal description of the key concepts in NMA. We turn to a more formal mathematical setup in section 2.4.

J. Stat. Mech. (2022) 11R001
Network meta-analysis: a statistical physics perspective 2.1.1. Randomised controlled trials. . For our purposes a trial is an experiment in which a group of subjects (patients) is used to compare a given set of treatment options. The different treatments are referred to as the arms of the trial. In particular, a 'controlled' clinical trial is one with at least two arms. Typically, this involves one or more 'experimental' treatment groups representing new treatments being tested. These are compared to the so-called 'control' group(s) which could be alternative (existing) treatments, a placebo or no treatment [25].
The allocation of subjects to the different arms is randomised to ensure that individual-level effect modifiers will be balanced across the treatments. Individual-level effect modifiers are patient characteristics that might explicitly or implicitly modify the effect a treatment has on a patient, such as age, employment status, or gender. The treatment assigned to a given subject can, for example, be chosen with equal probability from the arms of the trial. In this scenario the trial is a 'randomised controlled trial' (RCT). The randomisation process makes RCTs preferable to so-called 'observational studies' in which treatment assignment is not controlled by the researchers. In these more 'realistic' settings, patients are likely to be prescribed treatments based on characteristics such as disease severity, age or medical history. This means that effect modifiers are not equally distributed between trial arms which in turn can cause bias when trying to determine the treatment effects.

Trial outcomes. .
Once assigned to a trial arm, each subject receives the respective treatment, and undergoes follow-up. Depending on the clinical question of interest, specific health outcomes are recorded. These outcomes could reflect the efficacy of a treatment for example in reducing the symptoms of a condition, curing a disease or prolonging life. Alternatively one may be interested primarily in the safety or 'acceptability' of a treatment. Outcomes may then reflect the side effects or burden that the treatment intervention places on patients. Due to the wide range of clinical applications, the type of outcome data reported by trials can take many forms. Common examples include time-to-event data, so-called categorical outcomes (discrete outcomes from a set of possible values) and continuous measurements (such as blood pressure or body-mass index) [17].
The simplest type of outcome is a binary (dichotomous) variable. We can also think of this as 'an event has occurred' vs 'no event has occurred'. For the purposes of this introductory review we will focus on this case of binary outcomes. We hope our introduction helps the interested reader to study generalisation later on if required. For simplicity, we will have a (somewhat stylised) picture mind: the treatment either 'works' or it 'does not work'. We will refer to the probability that the treatment works as its 'effectiveness'. The formalism for binary outcomes is obviously independent of this interpretation, and applies just as well if the outcomes reflect for example safety (occurrence or non-occurrence of adverse effects). We will also exclude other complications such as censoring (e.g. patients withdrawing from the trial or otherwise not being followed up, and therefore not producing data).

Meta-analysis. .
Meta-analysis in general is concerned with combining evidence from multiple trials. The simplest case is 'pairwise' or 'standard' meta-analysis. Two treatment options are compared in a set of different trials. The purpose of meta-analysis is then to 'integrate' the outcomes of the different trials, and to estimate how effective the competing treatment options are. These estimates can then be used to decide if one of the two treatments is to be preferred over the other, and which one.
In this process it is important to bear in mind that the outcomes of different trials cannot always be aggregated directly. That is, we cannot simply use the results across the different trials as if all patients were part of one large study 4 . Clinical trials taking place at different locations will draw from a local patient pool, and as a result, the general characteristics of the subjects may differ from trial to trial (e.g. age, health or economic status, level of education, etc). These effect modifiers may change the observed treatment effects.
In order to combine evidence from multiple trials we require an underlying model-a stochastic process with unknown parameters taking into account possible trial-to-trial variation of effects, and leading to realisations of the data observed in trials. Two of the most common modelling approaches, fixed and random effects models, are discussed in section 2.4. Once a given model assumption has been made, the objective of metaanalysis is to estimate the parameters of the model from the data.

Networks of trials and network meta-analysis.
.General networks of treatments and trials capture more complex situations than the pairwise situation described in section 2.1.3. For example imagine that there are four different treatment options and several trials, each comparing a subset of treatments. Not every trial tests all four treatments, but the same pairwise comparison is perhaps made in different trials. This generates a network of treatment options and trials.
A possible scenario is illustrated in figure 2. The network consists of three trials (indicated by square boxes), comparing different subsets of four different treatments figure 2(a). We label the treatments T 1 , . . . , T 4 , and indicate them as circles on the graph. Most notably in this example, there is no pairwise comparison between two fixed treatments that is made in all three trials (i.e. no pair of treatment arms features in all three trials). The pair T 2 -T 3 appears in trials 1 and 2, so the use of conventional (pairwise) meta-analysis would be restricted to combining information regarding this particular pair from trials 1 and 2 only.
NMA aims to integrate further information from the network. The information about the pair T 2 -T 3 from trials 1 and 2 is referred to as direct evidence for the comparison of these two treatments. However, T 2 and T 3 are each also compared to T 1 . For treatment option T 2 this happens in trial 1, and for T 3 in trials 1 and 3. These comparisons to a common third treatment provide indirect evidence for the comparison of T 2 and T 3 .
In the example, treatment options T 2 and T 4 are not compared directly in any trial. However, each of the two are directly compared to T 1 and T 3 (see figure 2(b)). This indirect evidence can be used to infer information about the comparison between T 2 and T 4 , even though there is no direct evidence.
Data example. Figure 1 shows the network for a real NMA data set comparing nine treatments for acute myocardial infarction (heart attack) [10][11][12]. The treatments , and four treatments in total (circles). (b) Trial 1 has three arms (treatments T 1 , T 2 and T 3 ), trial 2 is two-armed (treatments T 2 and T 3 ), and trial 3 tests treatments T 1 , T 3 and T 4 . (c) Presentation of the network as a graph with only one type of node. Each node represents one treatment, and two treatments are connected if they have been directly compared in at least one trial. The thickness of the edge connecting two nodes is proportional to the number of trials comparing those two treatments. This representation does not contain full information about the network of treatments and trials. (d) Representation as a bipartite graph of treatments and trials. This can also be understood as a hypergraph (related concepts include incidence or Levi graphs [26]). are labelled T 1 , . . . , T 9 and their names are given in the figure. They consist of eight thrombolytic drugs and one angioplasty intervention (T 7 ). The data has therefore been referred to as the 'thrombolytic drug' data set [12,27]. A heart attack occurs when blood flow to the heart is cut off, usually resulting from blockage of one or more of the coronary arteries. Thrombolytic drugs aim to dissolve blood clots that have blocked arteries whereas angioplasty is a procedure that tries to relieve blockage by widening the arteries. The data consists of 50 trials; two three-arm trials (comparing T 1 , T 3 , T 4 and T 1 , T 2 , T 9 , respectively) and 48 two-arm trials. Each trial records the number of deaths in each treatment arm that occur within 30 or 35 days of a heart attack. The network is represented by a weighted graph where for each pair of treatments, the thickness of the edge link is proportional to the number of trials comparing these two treatments. The area of each node is proportional to the number of patients allocated to the treatment represented by that node. Multi-arm trials are not explicitly indicated on the graph.

General notation for networks of trials and treatments
We write N for the total number of treatment options in the network. We label the different treatments T 1 , T 2 , . . . , T N , and we will use the indices a, b when we refer to elements of the set of treatments, i.e. a, b ∈ {T 1 , . . . , T N }. The number of trials in the network is denoted by M, and we use the indices, i, j to refer to the different trials, i.e. i, j ∈ {1, . . . , M}. We will use the words 'trial' and 'study' synonymously.
Each trial compares a subset of treatments. We write A i ⊂ {T 1 , . . . , T N } for the set of treatment options compared in trial i. Hence m i ≡ |A i | is the number of arms of study i. We number the arms of trial i by = 1, . . . , m i , and denote the treatment given to patients in arm of trial i by t i, . Each t i, (i = 1, . . . , M, = 1, . . . , m i ) is a treatment from the set {T 1 , . . . , T N }.
We write n i, for the number of subjects receiving the th treatment in trial i, with = 1, . . . , m i . Focusing on binary outcomes, the data available for each patient is whether an 'event' has occurred by the end of the study or not. (As noted earlier in section 2.1.2, depending on the context, an event can either be treatment success or an adverse event. The latter is the case in the data example in section 2.1.4.) For each arm of trial i, the number of resulting events is recorded. We denote this by r i, . This quantity takes integer values in the range 0, 1, . . . , n i, .
Summarising, trial i is defined by the treatments it compares, t i,1 , . . . , t i,m i , by the number of patients in each arm, n i,1 , . . . , n i,m i , and by the number of events in each arm r i,1 , . . . , , r i,m i .

Absolute and relative treatment effects
2.3.1. Absolute treatment effects. .
As explained in section 2.1.2, trial outcomes come in many forms, i.e. they could be binary, or the outcomes could take one of multiple discrete values, or indeed be continuous. Depending on the type of outcome data reported in the trials we must make an assumption about how this data is generated in a mechanistic model.
We generally assume that the distribution from which the outcomes are drawn in arm of trial i is determined by a parameter λ i, . In a model with binary outcomes the λ i, determine the probabilities with which a patient experiences an event in the different trial arms. We will refer to the λ i, as the 'absolute treatment effects'. The word 'absolute' here indicates that these treatment effects are not measured in comparison with any other treatment. At the end of this section we will briefly discuss the choice of the word 'effect' further.
It is important to distinguish between true (but unknown) parameters and estimates of these parameters obtained from trial data. The λ i, and the relative treatment effects (introduced below) describe the 'true' underlying effects of the treatments in the trials. They are not known to those who conduct the trial, or those who analyse the data in an NMA. In a simulation these parameters would have to be generated from a model, and they can then be used to generate synthetic trial data. Parameters that are estimated in an NMA will be labelled with a hat (ˆ) to distinguish them from (true) model parameters.
Binary outcomes. For the example of binary data, we assume that the application of the treatment in arm of trial i generates events with probability p i, independently for each of the n i, patients at the end of this trial arm [28]. As a consequence, the number of events r i, in arm of trial i is a binomial random variable, for i = 1, . . . , M and = 1, . . . , m i . The {p i, } are then a measure of the absolute treatment effects, they capture how likely it is that the different treatments produce 'events'. For other types of outcome data we replace equation (1) with the relevant distribution. A brief remark is here required: while p i, is the probability that a patient in arm of trial i experiences an event, we do not make any statement about what the event probability would have been if the patient had received no treatment. The quantity r i, counts all events in the trial arm, not only those specifically 'caused' by the treatment. Despite this caveat (which should always be kept in mind), we find it useful to refer to the p i, as (absolute) treatment effects.

Link functions. .
In the example of binary outcomes, the parameter p i, is a probability, and thus restricted to the interval between 0 and 1. Similar constraints on the range of values for λ i, may apply for models with other types of outcomes.
For modelling and analysis purposes it is often more convenient to work with parameters taking values from the entire real axis. For example, we will later assume that the distribution of effects across trials is Gaussian (section 2.4.2).
One therefore carries out a transformation via a so-called 'link function'. We here cannot summarise the full theory behind the choice of link functions, or more specifically so-called 'canonical link functions', instead we refer to [17,29,30] for an overview. Here, we only describe the transformation for binary data. This involves expressing the absolute treatment effects in terms of so-called 'log-odds'.
For a probability p ∈ [0, 1] the term 'odds' refers to the ratio p/(1 − p), and the log-odds or logit ('logistic unit') is defined as While the original probability p is restricted to the range 0 p 1, the logit of p can take values on the entire real axis, with lim p→0 logit(p) = −∞, and lim p→1 logit(p) = ∞. For binary data one makes the following choice for the transformation between the λ i, and the event probabilities p i, : The logit function is the canonical link function for binomial data. For other treatment outcomes different (canonical) link functions would be used to transform the parameter of interest onto a continuous scale on the real axis [17].
In the context of binary data, and in a slight abuse of terminology we will refer to both the {λ i, } and to the {p i, } as the absolute treatment effects. From the context it will be clear what we mean. We note that exchanging the definitions of event versus no event (i.e. p ↔ 1 − p) only results in a sign reversal, i.e. it makes no difference for the mathematical model and inference process. This is not true for some other choices of the link function [31,32].

2.3.3.
Relative treatment effects, transitivity and common baseline at trial level. . We next introduce the relative effect between treatments a and b in trial i. It is important to remember that (within the model we are about to set up) these describe the true underlying effects, not effects estimated from data. For binary outcomes, we write for the effect of treatment b relative to a in trial i. This is the logarithm of the ratio of the odds of the two treatments, the so-called 'log-odds ratio' (LOR). The trial-specific relative treatment effects fulfill the transitivity relation for all treatments a, b and c in trial i. We note that equation (5) follows from the definition in equation (4). The model setup as a whole therefore assumes transitivity of relative effects at trial level. In our model there are m i arms in trial i, labelled = 1, . . . , m i . Without loss of generality we can designate = 1 to be the trial-specific baseline treatment. Using the transitivity relation in equation (5) it is then sufficient to specify the absolute effect p i,1 of the baseline treatment, and the treatment effects of the m i − 1 remaining arms in the trial relative to this baseline. We have for the effect of treatment t i, relative to this baseline. While we here focus on the case of binary outcomes and the logit link function, we re-iterate that transitivity, the introduction of trial-specific baseline treatments and relations of the type Δ i,1 = λ i, − λ i,1 carry over to models with more general trial outcomes and the corresponding link functions.

Fixed and random effects models
In this section we describe the random and fixed effects models used in meta-analysis. We will abbreviate these as RE and FE models respectively. The FE model is a limiting case of the RE model.

General idea: modelling relative effects. .
It is likely that the absolute treatment effect of any given treatment will vary between trials due to effect modifiers specific to the individual trials (e.g. different patient cohorts, differences in treatment dosage or administration). The most common approach to NMA is therefore to model the relative rather than absolute treatment effects. One here assumes that differences between the effects of treatments are similar across the trials. This is less restrictive than making such an assumption for absolute effects.
To do this, each trial comparing treatments a and b is associated with some unknown relative treatment effect which represents the 'true' difference in effectiveness between a and b in trial i. These are the Δ i,ab introduced in section 2.3.3. The FE and RE models differ in the assumptions placed on these 'trial-specific' relative effects.
The FE model assumes that the underlying relative treatment effect for two treatments a and b is the same for all trials involving those two treatments, i.e. Δ i,ab ≡ d ab ∀ i for some value d ab which does not depend on i.
The RE model on the other hand is more flexible. Rather than requiring that the trial-specific relative effects are the same in every trial, the true relative treatment effects in the different trials are treated as random variables drawn from a probability distribution [7]. One here assumes 'exchangeability', that is to say, for two treatments a and b the relative effects Δ i,ab for all trials i involving a and b are drawn independently and from the same distribution [5,33,34]. Variation in the relative effects of treatments between trials is referred to as between-trial 'heterogeneity'.
The RE model therefore consists of two levels of randomness; one due to variations between trials and the other due to sampling within a given trial. In the FE model, we only allow for the latter.

Model definitions. .
We now formalise the RE model, and in particular the two levels of randomness 5 .
First level of randomness: Between-trial variation. We assume that the relative treatment effects for a given trial i are drawn from a multivariate normal distribution, ⎛ ⎝ Δ i, 12 . . .
The first argument is the mean and the second, The assumption of a Gaussian distribution underlines again the need to work with variables taking values across the entire real axis and to apply a link function as described earlier in section 2.3.2. We will now describe these first and second moments in more detail. First moment: mean relative treatment effects. In equation (7), the relative effect between two treatments a and b in a given trial is a Gaussian random variable. In particular, the relative effect between these treatments varies across different trials involving a and b. The model parameters d ab are the averages of these random numbers. For a given pair a and b the parameters d ab can be interpreted as the 'typical' relative effect one should expect to see in a trial involving these treatments.
We adopt the convention d ba = −d ab and d aa = 0. We further assume that for any treatments a, b and c the relative treatment effects d ab , d cb and d ca fulfill the transitivity We re-iterate that transitivity is an assumption made about the true treatment effects in setting up the model. It is applicable to all possible comparisons in the network.
(As an aside we mention that transitivity is to be distinguished from 'consistency'. The latter is the 'statistical manifestation' [19] of transitivity in observed data. I.e. if direct and indirect evidence exist in the data for a particular comparison, then the data is consistent if there is no discrepancy in the treatment effects obtained via the two types of evidence.) Using transitivity, the relative treatment effects of all pairs in a network of N treatments, T 1 , . . . , T N , are fully specified by N − 1 numbers. For example, we can designate treatment T 1 as the overall 'global' (network-wide) baseline. It is then sufficient to know d T 1 a , for a ∈ {T 2 , . . . , T N }. These values are termed the 'basic parameters' [35,36] and we collect them in the ( indicates transposition). The relative treatment effect d ab for any pair of treatments a, b can then be determined from the linear relation in equation (8) using c = T 1 .
In particular, we can also write each entry in the vector of mean treatment effects in trial i, d i = (d t i,1 t i,2 , . . . , d t i,1 t i,m i ) as a linear combination of the basic parameters d . We therefore have with a suitable matrix X i . We note that d is a vector with N − 1 entries and that d i has m i − 1 components. The (m i − 1) × (N − 1) matrix X i describes which treatments are compared in trial i and is called the 'design matrix' of the trial. Each of the N − 1 columns of X i represents a treatment a ∈ {T 2 , . . . , T N }. The m i − 1 rows represent the comparisons of treatments t i, ( = 2, . . . , m i ) to the trial-specific baseline t i,1 . Details of the construction of the design matrix can be found in appendix A, along with an example.
Second moment: heterogeneity. The (m i − 1) × (m i − 1) matrix Σ i in equation (7) describes the variance of the relative treatment effects, Δ i,1 ( = 2, . . . , m i ), and their correlations. Following [37][38][39], we will assume that its diagonal elements are all identical. We write τ 2 for their common value. This is the variance of each Δ i,1 . We will further assume that the covariance between any two treatment effects is τ 2 /2 (these are the off-diagonal elements of Σ i ). This ensures that the relative effect Δ i,1 − Δ i,1 between any two treatments = in trial i has variance τ 2 . 6 The between-trial variance τ 2 is termed the heterogeneity variance. We will refer to its square root, τ , as the heterogeneity parameter. Occasionally we will simply use 'heterogeneity' to refer to either the parameter or the variance but it should be clear from the context what we mean. Usually this distinction in not important.
Second level of randomness: Sampling noise in a given trial. In a second level of randomness the model assumes that the application of the treatment in arm of In a first step of randomness realisations of the random variables describing the treatment effects in the different trials (p i,1 and Δ i,12 , . . . , Δ i,1m i ) are drawn for each trial from the distribution in equation (7), supplemented by a distribution for each p i,1 . These are then used along with equation (6) to construct the absolute effects of the treatments in each trial. From these, and using the number of participants (the {n i, }), the number of events in each arm (the {r i, }) are then drawn from the binomial distributions in equation (1). The fixed effect model is the special case τ = 0. The distribution in equation (7) then turns into a delta-distribution. In this scenario, the true relative effect between two treatments a and b does not vary between trials and is given by d ab .
trial i generates events with probability p i, independently for each of the n i, patients at the end of this trial arm [28]. Each r i, is then a binomial random variable, as described in equation (1). The setup for binary outcomes can be generalised to more complex situations in which outcomes can take more than two discrete values, or where they are continuous.
The RE model (for binary outcomes) is summarised and illustrated in figure 3. The FE model is simply a special case of the RE model where τ = 0. As previously explained, for any fixed pair of treatments, there is then no variation in the relative treatment effects between trials.

Generation of synthetic data in simulations
The different levels of randomness in the model can be understood by thinking about how one would simulate synthetic trial data in line with the model assumptions. We describe this for binary outcomes.
The process begins by defining the fixed parameters of the network. First, we pick the network configuration; the number of treatments N, the total number of studies, M, the number of arms in each trial {m i }, the set of treatments these arms relate to {t i, }, and the number of participants in each arm {n i, }. We then assign the 'true' values of the model parameters, i.e. of d = Following this set-up, we generate independent realisations ν = 1, 2, . . . , Ω of synthetic trial outcomes. Specifically, for each ν: (1) For all trials i, randomly sample the parameters Δ i,1 , = 2, . . . , m i , from the multivariate normal distribution in equation (7). (2) Using the Δ i,1 , = 2, . . . , m i , and a procedure still to be defined (see below), construct the probabilities p i, , = 1, . . . , m i , for all trials i in the network. (3) For each trial arm, generate random event data ('observations'), r i, , from the binomial distribution in equation (1). In step (2) of the simulation procedure, the relative treatment effects Δ i,1 ( = 2, . . . , m i ) in any one trial i do not uniquely define the absolute treatment effects p i, ( = 1, . . . , m i ) required for step (3). Equation (6) can be re-arranged to give so that p i,1 together with the Δ i,1 ( = 2, . . . , m i ) specifies all absolute treatment effects in trial i. To fully define step (2) in the above algorithm it is therefore sufficient to specify the construction of p i,1 . A discussion of possible data generating models for p i,1 is given in [40]. We briefly describe two possible methods, highlighting the resulting symmetry or asymmetry introduced.
One simple procedure involves sampling p i,1 for each trial from some specified distribution. For example one could choose a uniform distribution between two limits (perhaps zero and one to sample the full range of effects) or from a normal distribution truncated at zero at the lower end, and at one at the upper end. One then obtains the other absolute treatment effects via equation (10). By using a different method for generating the absolute effect of the trial-specific baseline (p i,1 ) compared to the other absolute treatment effects in the trial (p i, =1 ), this method introduces asymmetry into the generation procedure. A simple way of reducing the effect of this asymmetry on the synthetic data is to randomly select the trial-specific baseline treatments at each iteration ν.
An alternative data generation model was proposed by Seide et al [40]. Here, one chooses the absolute treatment effects for the baseline treatment to be the value that minimises the Euclidean distance of the vector (p i,1 , . . . , p i,m i ) from the vector (1/2, . . . , 1/2), i.e.
where p i, [·, ·] is the expression given in equation (10). This method maintains symmetry since all m i absolute treatment effects in trial i are determined simultaneously.

The process of carrying out a network meta-analysis-brief overview
The aim of NMA is to estimate the mean relative treatment effects d ab for all pairs a = b, and the heterogeneity parameter, τ . Given the transitivity relation in equation (8) not all d ab are independent. As explained earlier, we can use treatment a = T 1 as the overall global baseline treatment. It is then sufficient to estimate d T 1 a for a = T 2 , . . . , T N .

Frequentist versus Bayesian network meta-analysis. .
In sections 3 and 4 below we describe the two main approaches to carrying out an NMA, i.e. the steps that are used to infer parameters (such as relative treatment effects) from the observed trial data. The two approaches correspond to the two main branches of inference, frequentist and Bayesian inference.
Much has been written about the difference between Bayesian and frequentist approaches to inference (e.g. [41][42][43][44][45]). One central point distinguishing the two is the conception of probability. Frequentist inference defines the probability of some event in terms of how frequently the event occurs if we repeat some process (e.g. an experiment) many times [41]. The Bayesian approach instead uses probability to describe the degree of belief in a statement [46,47]. In the Bayesian framework, parameters such as treatment effects are considered random variables, where the randomness reflects the remaining uncertainty after the inference process. A sharp distribution for a parameter indicates that we can be fairly certain that the inferred parameter is in a given range around the mode of that distribution. If the distribution is wide then the strength of our beliefs is weak. In the Bayesian approach probability therefore becomes subjective. It is not a property of the system only, but also of the prior beliefs, and the information available to the observer. Given the Bayesian interpretation of probability as a degree of belief, we can make statements such as 'given our prior beliefs and the data we have observed, we think that treatment A is more effective than treatment B with probability 70%'. In frequentist inference, probability is not an expression of beliefs. Therefore, the corresponding statement would be, for example, 'based on hypothesised repetitions of the experiment, treatment A would be estimated to be more effective than treatment B 70% of the time'.
A concept used in both Bayesian and frequentist inference is the 'likelihood function'. For observed data D, the likelihood function is the conditional probability (or probability density) of observing this data given a specific set of model parameters θ, P(D|θ). 7 The likelihood function is so named because it describes how likely it is to observe the given data for different values of parameters. In fact, the likelihood is viewed as a function of the parameters rather than the data and-somewhat confusingly-is often written as L(θ|D).

Arm-based versus contrast-based data. .
In the setup so far we have treated the {r i, } (the number of events in the arms of the trials) as the trial outcome or 'data' in an NMA. Since each of the r i, is associated with an arm in a trial, data of this type is referred to as 'arm-level' data.
Trial outcomes can also be reported in the form of relative treatment effects. One example is the LOR between the effect of treatment in trial i and the baseline treatment in that trial, The quantity y i,1 is an estimate of Δ i,1 from the data in trial i only and is described as the 'observed' relative effect in that trial. The observed LOR is a so-called 'summary statistic' and data of this type is called 'summary-level data'. Sometimes only data of this type is reported by the trial (as opposed to, for example, detailed information about the r i, in each arm).
In NMA we can choose to model data on the arm level or the summary level. We refer to these approaches as 'arm-based' (AB) and 'contrast-based' (CB) models respectively [22]. Arm-level data is modelled using the assumed distribution for the trial outcomes. In the case of binary outcomes, for example, this is the binomial distribution in equation (1). The likelihood function of the data is then also based on this distribution. This is sometimes referred to as the 'exact-likelihood' or 'AB-likelihood' [48] approach. In contrast-based models the observed relative effects in each trial are modelled as following a normal distribution. This is an approximation, and the approach is also referred to as the 'approximate-likelihood' or the 'CB-likelihood' [48] model. 8 Both frequentist and Bayesian inference methods can be used to evaluate both AB and CB models. In practice, frequentist models [38,49,50] are usually based on contrastlevel summaries while Bayesian models [9,17,39] tend to use arm-level data [51]. Clearly, if trials only report summary-level data then we are restricted to CB models.
In the next section we summarise the general Bayesian approach to inference, and describe how this is applied in an arm-based NMA model. In section 4 we first describe the contrast-based NMA model and give an overview of the frequentist approach to inference. We then show how frequentist inference can be used to estimate relative treatment effects in a CB model.

Bayesian network meta-analysis
In this section we discuss the Bayesian approach to NMA. We use an arm-based model and treat the number of events in each arm of each trial (the r i, ) as the raw data in the model.

General approach
The process of Bayesian NMA converts prior beliefs on the distribution of model parameters into posterior distributions using the observed data. The approach is based on Bayes theorem, which in its simplest form can be stated as P (A|B) = P (B|A) P (A) P (B) , where A and B are outcomes of a probabilistic experiment. Writing θ for the parameters, and D for the data, this becomes We are interested in the distribution of parameters given the observed data. In this context, we notice that the term P(D) is not a function of θ, and so we can write where the constant on the right is to be determined from normalisation. This is the fundamental equation for Bayesian NMA (and any other type of Bayesian inference). The object P(θ) on the right is known as the prior distribution of parameters. It reflects our beliefs about what the parameters might be, before we have taken into account the data D. The expression on the left is the posterior distribution of the parameters, it represents our updated beliefs having observed and used the data D. The factor that connects the two is the conditional probability, or likelihood function, P(D|θ) (see section 2.6.1).

Hierarchical structure of the random effects model
Focusing again on the special case of binary outcomes, the NMA random effects model is given in equations (1), (6) and (7), and is illustrated in figure 3. In this context, the parameters θ are the true relative treatment effects d and the heterogeneity parameter τ . These are the 'parameters of interest' [41,52], and we will refer to them simply as the 'model parameters'. As described in section 2.4 there are two levels of randomness in the RE model. The first layer generates trial-specific relative treatment effects Δ i (i = 1, . . . , M) and absolute effects for the baseline treatment in each trial (the p i,1 in figure 3). In a second layer, binomial outcomes are then produced for each trial arm. The trial-specific effects Δ i and the p i,1 , are so-called 'nuisance parameters' [41,52]. For the discussion of the general Bayesian approach we will call these ν. They are random variables, and their distribution is parameterised by the model parameters θ.
The nuisance parameters in turn determine the distribution of the output data D. This is captured by the following relation We write P in (ν|θ) to describe the internal layer of the model (generation of nuisance parameters from the model parameters), and P out (D|ν) for the 'output layer' (generation of data from the nuisance parameters).
Using equation (14) we then have where P(θ) is the prior distribution of the parameters θ.
In section 3.3 below, we focus on the construction of This is the joint distribution of the model parameters θ, the nuisance parameters ν and the data D.
To obtain the posterior distribution P(θ|D) one fixes D to be the observed data. The next step then is to integrate out the nuisance parameters in equation (16). The normalisation constant in this equation can be determined at the end.
This reduces the problem of carrying out an NMA to two tasks: (a) We need to construct explicit forms for the factors on the right-hand side of equation (17); (b) We need a method with which to integrate out the nuisance parameters, and to extract the posterior distribution for θ. We will first discuss step (a) (for the example of binary outcomes). Numerical methods for step (b) are described in section 3.4.

Choice of priors for the model parameters. .
The parameters of interest in the model are the heterogeneity τ , and the true treatment effects of treatments a ∈ {T 2 , . . . , T N } relative to the overall baseline treatment T 1 . The method requires distributions capturing prior beliefs on the values these parameters might take.
It is common to choose a Gaussian distribution as the prior for the relative treatment effects. The prior for τ has evoked more discussion [37,[53][54][55][56], but the usual practice is to use a uniform prior distribution between zero and some upper limit, τ max , which can depend on the data [17,36].
This results in the form with the indicator function 1 1 [x,y] (τ ) = 1 for x τ y, and 1 1 [x,y] (τ ) = 0 otherwise. The product over a has N − 1 factors (one for each a ∈ {T 2 , . . . , T N }) and indicates that the prior distribution for each of the true relative treatment effects d T 1 a is a Gaussian distribution with mean zero, and variance σ 2 d . In using this factorised form, we have assumed that these parameters are a priori pairwise independent [18,21].
It is common to use so-called 'non-informative' priors. These are relatively broad distributions for each of the parameters, reflecting a situation in which little information about the parameters is known a priori. This can be achieved by choosing values for τ max and σ 2 d that are large compared with the typical scale of the parameters τ and d respectively. What constitutes as 'large' is informed by the type of data and the medical condition/clinical question of interest.
Occasionally it may be necessary to use an informative prior for the heterogeneity parameter [17,37,[57][58][59]. If the network contains very few trials per comparison then there is little information about the variation between trials and the estimation of τ is likely to be imprecise [59]. In particular, when a flat prior is used with data that gives little information about between-trial variance then the posterior of τ will be dominated by the prior which may lead to unrealistically high estimates of heterogeneity [17]. Informative priors have been proposed for τ based on external data, for example databases of existing meta-analyses that relate to the relevant data type, medical condition and interventions [57,58]. The use of such priors then allows us to incorporate external information about τ into the inference process.

Distribution of nuisance parameters for given model parameters. .
The model parameters d = (d T 1 ,T 2 , . . . , d T 1 ,T N ) and τ of the RE model determine the distribution of relative treatment effects Δ i for each of the trials i = 1, . . . , M. As explained in more detail in section 2.4.2, the RE model assumes that each entry of Δ i is drawn from a Gaussian distribution centred on X i d , where X i is the design matrix for trial i. The variance of each component of each Δ i is τ 2 . The covariance between any two different entries of Δ i is assumed to be τ 2 /2, and there are no correlations between Δ i and Δ j for two different trials i = j.
Putting this all together we have where for a given trial i, the matrix Σ i is of size (m i − 1) × (m i − 1), and has diagonal entries τ 2 , and off-diagonal entries τ 2 /2 (see section 2.4.2). The term P bl,i (p i,1 ) is the distribution for the absolute treatment effect associated with the trial-specific baseline.
It is here common to use a non-informative distribution, such as a normal distribution for logit(p i,1 ) with large variance [17,21].

Distribution of data for given nuisance parameters.
. Finally, the data is given by the number of 'events' in each arm of each of the trials. The nuisance parameters p i,1 and Δ i,1 ( = 2, . . . , m i ) for trial i translate into event probabilities for the treatments in this trial via equation (10) which we re-state here for convenience, For binary outcomes the distribution of the data for given nuisance parameters is the binomial for each arm, with parameters n i, and p i, . Combining this for all arms of all trials in the network we arrive at

Computational techniques
The posterior distribution for the model parameters is obtained from where U is defined in equation (17). Even though U is known in closed form, the integral over ν cannot be performed analytically. Due to the high-dimensionality of the integral, direct numerical integration is not always viable either. One therefore resorts to computational methods to sample combined values for ν and θ, and then considers the resulting marginal distribution for θ. The most common techniques to do this in meta-analysis are MCMC methods. Popular MCMC software include WinBUGS [13], JAGS [60] and Stan [61].
MCMC methods are a class of algorithm based on the construction of a Markov chain with a stationary distribution given by the target distribution. By observing the chain after a large number of steps using Monte Carlo simulations, one eventually produces samples from the target distribution.
The MCMC implemented in the WinBUGS software combines the celebrated Metropolis-Hastings algorithm with Gibbs sampling, resulting in what is called 'Metropolis-in-Gibbs sampling'. We here briefly outline the main principles.

Metropolis-Hastings algorithm. .
We discuss this topic in a more general sense (independent of NMA) and write the distribution from which we would like to sample as p(x ). The Metropolis-Hastings algorithm [62] is based on a Markov chain producing a trajectory x t , t = 0, 1, 2, . . . . Each step consists of proposing a value for x t+1 , followed by a decision whether to accept or reject this proposed value. The distribution of proposed values and the acceptance criterion only depend on x t , but not on states visited earlier in the sequence. They are constructed such that the resulting process has stationary distribution p(x ) [63]. Provided one allows for a sufficiently long equilibration time t eq , the set {x t : t > t eq } represents a statistically faithful sample of the distribution p(x ).
The algorithm results in an overall probability A(x |y ) = q(x |y )p a (x |y ) to transition to x if the chain is currently at y . Using the fact that exactly one of p a (x |y ) and p a (y |x ) is equal to one for each pair of states x and y , one has p(x )A(y |x ) = p(y )A(x |y ) for all x and y , i.e. the detailed balance condition holds. This is sufficient to demonstrate that p(x ) is indeed a stationary distribution of the process. We also need to choose the hopping kernel q such that the stationary distribution is unique (i.e. the Markov chain must be irreducible and aperiodic). A sufficient condition to ensure this, is that q is positive everywhere [64]. For further details see also [65,66].
The process simplifies if the proposal function (the hopping kernel) is symmetric, q(x |x t ) = q(x t |x ). In this case the acceptance probability in step (c) of the algorithm becomes An example of a symmetric proposal function q(x |x t ) in a univariate system (x t ∈ IR) is a normal distribution centred on the current sample value x t with a fixed variance [67]. The choice of variance is not straightforward, as illustrated in figure 4. Choosing a value that is too high means that most proposed values are rejected and the sequence of {x t } remains constant for long periods of time. This scenario is shown in figure 4(b). On the other hand, if the variance is too small the chain does not explore the state space and convergence to the stationary state is slow (see figure 4(a)) [68]. Both of these scenarios make the MCMC less efficient and mean that more iterations are required for the chain  (c) is for a standard deviation of 2.5 (acceptance rate 0.43). The optimal acceptance rate for a model in one dimension is approximately 0.44 [69].
to reach the stationary state. If the proposal variance is tuned so that the chain has a 'reasonable' acceptance rate then the state space is explored efficiently. A chain with this characteristic is shown in figure 4(c). For an n-dimensional target distribution the optimal acceptance rate has been found to be approximately 0.44 for n = 1 and declines to 0.23 as n → ∞ [68,69]. In the WinBUGS software, the acceptance rate of proposed values is tuned to between 0.2 and 0.4 [13].

Gibbs sampling. .
Gibbs samplers are used to sample from multivariate distributions. They update one variable at a time, and are used when sampling from conditional probabilities of individual variables is easier than direct sampling from the multivariate distribution. The algorithm cycles through the individual variables, and samples from the conditional distribution of one variable in the target distribution given the current values of all other variables [67,70]. This can be shown to generate a sequence of multivariate samples faithfully representing the joint distribution [70].
Here we describe details of the Gibbs sampling procedure for a target distribution p(x ) of n variables, where x = (x 1 , x 2 , . . . , x n ). The conditional probability distribution for the variable x i given all other variables is given by We have written In NMA we are interested in samples from the distribution U(D, ν, θ) = P out (D|ν)P in (ν|θ)P(θ) in equation (17), for a fixed D. To generate these samples we use the 'Metropolis-in-Gibbs' algorithm. This is a Gibbs sampling algorithm with a Metropolis-Hastings accept/reject step used to sample from the conditional distributions for the individual variables, i.e. we use the Metropolis-Hastings algorithm at each single-variable stage of the Gibbs sampling algorithm. (In the box below we write v Δ , v b , v τ and v d for the variances of the hopping kernels for the different variables.) We note that the acceptance probabilities p in each step of the algorithm are of the form where parameter is the proposed value for the model or nuisance parameter that is being updated, and paramater t its value in the previous iteration. Crucially, the 'other parameters' and the data in the numerator and denominator in equation (34) This means that we can use joint probabilities (or probability densities) instead of conditional probabilities.
Using the product form of the distribution U in equation (17) and the fact that P in , P out and the prior further factorise, the ratios in equation (35) can be simplified even more by cancelling factors that do not depend on the parameter that is being updated.

Assessing convergence. .
The MCMC dynamics define a stochastic process with a stationary probability distribution for the model parameters. The process is constructed such that this stationary distribution is the target distribution we set out to sample from. Formally, the stationary distribution is reached only at infinite time, but in practice samples are effectively drawn from the target distribution after sufficiently many iterations. In simulations, we therefore discard the samples of the first n c iterations (this is referred to as the 'burn-in' in the statistics community, physicists know this as equilibration time or transient). We then make inferences about our parameters based on samples taken in the subsequent iterations.
To obtain accurate inferences on the parameter values we must, therefore, assess the number of iterations required to reach stationarity. A common method to assess this was developed by Gelman and Rubin [71] and later modified by Brooks and Gelman [72]. The latter article gives a detailed description of the approach, we summarise the main ideas in appendix B.
Once the Markov chain has reached stationarity, we make inferences about the model parameters from the samples taken in the simulations. Usually this means calculating a central value of the distribution of samples (such as the mean or median) and some measure of spread (such as standard deviation). We discuss how exactly the results of the NMA are reported in section 5.

Software and example code
Historically, Bayesian NMA has most frequently been implemented using the MCMC software WinBUGS [13]. The MCMC methods described in this article are the ones implemented in this software. Examples of WinBUGS source code for NMA are widely available (see e.g. [73][74][75]). Most notably, WinBUGS code accompanied the seminal 2004 paper by Lu and Ades [39] in which the first Bayesian NMA models were introduced. In addition, the NICE (National Institute for Health and Care Excellence) Decision Support Unit's Technical Support Document 2 [17] provides detailed WinBUGS codes for a number of NMA models with various types of outcome data.
More recently, other software packages for Bayesian MCMC (and Bayesian NMA specifically) have been developed. This includes PyMC [76] (a Python package), Open-BUGS [77] (an open source version of WinBUGS), JAGS [60] (similar to WinBUGS and OpenBUGS), gemtc [78] (an R package specific to NMA which requires JAGS) and Stan [61] (an open source software written in C++ that interfaces with languages such as Python, R and MATLAB). Stan appears to be emerging as an increasingly popular alternative to WinBUGS [79]. Examples of Stan code for NMA can be found at the URL [80].
In connection with previous work [81] we wrote our own C++ code to implement the Bayesian methods described in this review. This earlier work constitutes a simulation study focusing on the effects of network structure on NMA outcomes. The code was designed specifically for the purposes of this earlier work, and is available directly from [81].

Frequentist network meta-analysis
We now move on to the frequentist approach to NMA. In this section we use the contrast-based variant of NMA and treat the relative treatment effects measured in each trial as the raw data in the model. For binomial outcomes in each trial arm, these are the observed log odds ratios (LORs) y i,1 . The y i,1 are continuous (relative) outcomes obtained from the binomial (absolute) outcomes. This section is therefore not restricted to LORs, but instead holds for any continuous relative effects (making the assumption that these effects are normally distributed).
This section is structured as follows: in section 4.1 we introduce the contrast-based NMA model and write it as a linear regression problem. Then, in section 4.2 we discuss the frequentist approach in more general terms and describe how we estimate regression coefficients in a linear regression model. We start by explaining the ordinary least squares (OLS) method and then use this to derive the GLS problem. We then explain the maximum likelihood (ML) approach and show that this leads to the same condition as GLS for continuous outcomes under the normality assumption. We solve the GLS/ML problem to obtain the so-called 'Aitken estimator' of the regression coefficients. In section 4.3 we go back to the NMA model. First, we use the Aitken estimator to estimate the relative treatment effects d (section 4.3.1). Finally, we explain two common methods for estimating the heterogeneity variance τ 2 (section 4.3.2).

Introduction and notation
We begin by outlining the assumptions of the contrast-based NMA model and writing the inference task as a linear regression problem.
As in previous sections, the vector of relative treatment effects in each trial is assumed to follow a normal distribution. We recall equation (7), which we can write as where i = 1, . . . , M, = 1, . . . , m i and η i,1 is a Gaussian random variable of mean zero. Using transitivity (equation (8)) the mean d t i,1 ,t i, can be constructed via a linear relation from the vector of basic parameters d = (d T 1 ,T 2 , . . . , d T 1 ,T N ) (cf equation (9)). The variance of η i,1 is given by the heterogeneity τ 2 , and we note that for a fixed i the different η i,1 , = 1, . . . , m i will in general be correlated (see equation (7)). We can collect the relations in equation (36) for all trials i and all basic comparisons within each trial, and write more compactly Here, X is the design matrix of the network which can be constructed from the trialspecific design matrices described in section 2.4.2, X = (X 1 , . . . , X M ) .
The matrix X has N − 1 columns and M i=1 (m i − 1) rows, and each entry is either −1, 0 or 1 (see appendix A). Each column of X represents one of the treatments T 2 , . . . , T N (treatment T 1 is the overall baseline). The rows represent comparisons to the trialspecific baseline in each study.
As before, we write y i,1 for the observed relative effects in each trial (e.g. the LORs in equation (12)). These are assumed to follow a normal distribution centred on the mean value Δ i,1 with some random sampling error, ε i,1 (with variance σ 2 i,1 ). That is, where the sampling errors ε i,1 within a trial are correlated. Trial i ∈ {1, . . . , M} compares m i treatments and therefore contributes m i − 1 relative treatment effects (comparisons of treatments = 2, . . . , m i to the trial-specific baseline treatment).
Collecting the M i=1 (m i − 1) observations y i,1 in the vector y , we can write the linear model as The vectors η and ε represent the two levels of stochasticity in the RE model described in section 2.4; η ∼ N (0, Σ) models the trial-to-trial variation of relative treatment effects, and ∼ N (0, V) models the noise on the observed relative treatment effects resulting from the sampling in the trial arms. We stress again that we are working within a contrast-based model, where the sampling noise in the trial arms is assumed to be Gaussian. The covariance matrices for the two types of stochasticity are Σ and V respectively, we will discuss their mathematical form below. The two types of noise are independent of each other, and the overall covariance matrix is then C = Σ + V, such that the model in equation (39) can be written as The covariance matrix associated with the sampling errors in the trials is of block We stress that this describes sampling errors only, i.e. the matrix entries are the variances of the components ε i,1 , = 2, . . . , m i , and the covariances between these variables. The measurements of relative treatment effects within a multi-arm trial are correlated because they involve a common treatment arm (the trial-specific baseline treatment). The values that make up the matrices V i are assumed to be known (i.e. they are reported in the study, or can be directly calculated from the data-see appendix C for details). Further details can also be found in [5,7,17,28,82,83]. The covariance matrix Σ associated with the random effects represents the heterogeneity between trials. Similarly to V, it has block diagonal form, Σ = diag(Σ i ), where the blocks are the (m i − 1) × (m i − 1) matrices Σ i defined in section 2.4.2. The diagonal elements of Σ i are the variances associated with the random effects and the off diagonal elements relate to the correlations between the random effects within a multi-arm trial. We assume that these are determined by the unknown heterogeneity variance, τ 2 , as described in section 2.4.2. Determining τ 2 is therefore part of the inference problem. To understand this model, consider the example shown in figure 5 and described in the box below.
The aim of NMA is to estimate the unknown parameters d and τ 2 . In the language of regression models, d are the 'regression coefficients' of the linear model and τ 2 is the 'variance parameter'. In frequentist NMA, the variance parameter is estimated first, then this estimate is used in the estimate of the regression coefficients. In the following we describe two frequentist approaches for estimating the regression coefficients assuming knowledge of the variance parameter. We then relate this to the contrast-based NMA model. Finally, we describe some common methods for estimating the variance parameter.

General frequentist approach
Two frequentist approaches to inferring the regression coefficients of a model are based on 'ML' and 'least squares'.
In the ML approach one finds the values of the parameters that maximise the likelihood of observing the given data, or, equivalently, that minimise the negative log likelihood. In least squares regression, we start from equation (39). The vector y is observed from the trials, and the matrix X is known from the design of the trials (what treatments are tested in each trial). We therefore wish to find the vector d that best fits the observed data via equation (39). To do this a 'residual' is defined as the difference between the observed value of the response variable (here y ) and the mean value Xd predicted by the regression model. The model parameters (here d ) are then estimated by minimising the sum of the squared residuals, i.e. we find the values of the model parameters that 'best fit' the data. When measurements are associated with correlated random errors we must use so-called 'generalised least squares' (GLS) regression [84]. This will be explained in more detail later. For a linear regression model of continuous outcomes under the assumption of normally distributed errors, the ML estimates and GLS estimates are equivalent [85,86] and can be found analytically.
We now derive these estimates using the GLS procedure and show that this is equivalent to obtaining the ML estimates.

Ordinary least squares problem. .
We first describe this in general terms, and discuss the application to NMA further below. Assume we observe data {y i , x i,j } on n statistical units such that i = 1, . . . , n and j = 1, . . . , p. In the context of an NMA these would be the relative treatment effects and the design matrix elements. The latter are treated as part of the 'data' for the discussion in this section, as they are specific to individual instances of a real-world NMA. More generally, X may contain a set of observed covariates.
The values of the response variable are collected in the vector y = (y 1 , y 2 , . . . , y n ) . The predictor variables are placed in the n × p design matrix X = (x 1 , . , x i,p ) and X is assumed to have full rank (in NMA this is true by construction). We consider the linear regression model where β is a column vector of length p containing the parameters that we wish to estimate. The error term ε is assumed to be normally distributed. In the simplest case we assume uncorrelated errors and equal variances so that the covariance matrix for the components of ε is a multiple of the identity matrix. This assumption is known as the OLS condition. The vector of residuals, y − Xβ, represents the difference between the observed outputs, y , and the mean values predicted by the model in equation (42). The OLS estimates of the parameters are then obtained by minimising the sum of the squared residuals,β (Throughout the paper we denote estimates of model parameters by a hat.)

Generalised least squares problem. .
In GLS regression we relax the OLS assumption. That is, we make no assumptions on the form of the covariance matrix of the components of ε. We will use the notation C for the covariance matrix, i.e. ∼ N (0, C).
In the case of correlations or non-equal variances, equation (43) generalises tô Further details can be found in appendix D. If errors are uncorrelated but do not necessarily have equal variances then C is a diagonal matrix. Regression under these conditions is a special case of GLS known as 'weighted least squares'. The expression in equation (44) can be minimised analytically as we will see in section 4.2.4. Before we return to this we derive the ML estimator and show that this leads to the same condition as in equation (44).

Maximum likelihood approach. .
The linear model in equation (42) with normally distributed errors ∼ N (0, C) can be written equivalently as where we place no assumptions on the covariance matrix C except that it depends on a set of variance parameters φ. The likelihood of this model is then simply the multivariate normal distribution with mean vector Xβ and covariance matrix C, and the log likelihood is ln(L(β, φ|y, X)) = − 1 2 ln(det C) − 1 2 (y − Xβ) C −1 (y − Xβ) + const. (47) Treating the variance parameters as known, we infer the regression coefficients β by maximising the (log) likelihood with respect to β. This is equivalent to minimising the term (y − Xβ) C −1 (y − Xβ), which is identical to the GLS problem in equation (44).

Solution to the generalised least-squares and maximum-likelihood problem (the Aitken estimator)
. Now we proceed solve equation (44) (=equation (48)) to find the estimatorβ =β GLS =β ML . We first multiply out the product (y − Xβ) C −1 (y − Xβ), and then set the partial derivative with respect to β equal to zero. This leads to We address this term by term. The first term y C −1 y is independent of β and therefore differentiates to zero. The second and third terms take the forms a β and β a respectively, where a = −X C −1 y is a column vector of length p. We have a β = β a, and the second and third terms of equation (49) each evaluate to −X C −1 y . The last term on the right-hand side of equation (49) is quadratic in β. We also note that the matrix X C −1 X is symmetric. Therefore the final term in equation (49) evaluates to 2X C −1 Xβ. Combining these results, equation (49) reduces to Solving for β yields the GLS and ML estimator of the vector of regression coefficients, also known as the 'Aitken estimator' [87].
Recalling that E(y) = Xβ (see equation (42)) the expectation of this estimate is indicating that the Aitken estimator is an unbiased estimate of β.
To find the (p × p) covariance matrix of this estimate we make use of the result Cov(Az ) = ACov(z )A and find where we have used Cov(y ) = C and the fact that the matrices C and X C −1 X are symmetric.

Frequentist inference for network meta-analysis
In section 4.2 we derived the GLS/ML estimator of the regression coefficients for a linear regression model (equation (51)). We now use this result to estimate the relative treatment effects d in the NMA model from section 4.1 (equation (39)). Following this, we discuss frequentist methods of estimating the heterogeneity variance τ 2 .

4.3.1.
Estimating the mean relative treatment effects. . We start from the linear regression model for a RE NMA, The within-study covariance matrix V (describing the statistics of sampling noise) is assumed to be known, whereas the between-study covariance matrix Σ depends on the unknown heterogeneity variance τ 2 .
Assuming an estimate of the heterogeneity varianceτ 2 (and therefore the covariance matrixΣ), we find the mean relative treatment effects d via the Aitken estimator in equation (51), where we have labelled this explicitly as a random effects (RE) estimate, since the estimator depends on the heterogeneity τ 2 (or an estimateτ 2 ). It is useful to define the inverse-variance weight matrix W = (V +Σ) −1 . We can then writê Using equation (53) the covariance matrix associated with this estimator is given by To obtain the estimator for a fixed effects (FE) model,d FE , we simply set τ 2 = 0. We then have Σ = 0 and hence W = V −1 .
Special case: pairwise meta-analysis. In a pairwise meta-analysis of N = 2 treatments, each trial provides an estimate y i of the same relative treatment effect (we write d for its true value). The design matrix X is then an M × 1 matrix, and all entries are equal to one. The covariance matrix C is an M × M diagonal matrix with elements equal to σ 2 i + τ 2 and the within-study variances σ 2 i are assumed known. The RE model is then The weight matrix is also diagonal and, for a given estimate of heterogeneityτ 2 , its elements are equal to w i = (σ 2 i +τ 2 ) −1 . We then find that the Aitken estimator of the relative treatment effect reduces tô The variance of this estimate is Therefore, for pairwise meta-analysis, the GLS and ML approaches recover the results for a simple weighted mean of the sample. Again, for a fixed effect model,d FE is obtained by setting τ 2 = 0, that is,
The most widely used methods fall into two categories: (i) the method of moments [101], and (ii) restricted maximum likelihood (REML) approaches [102]. The former involves defining a measure of heterogeneity based on the sum of squared residuals. The latter involves modifying the likelihood function of the random effects model to remove dependence on the relative treatment effects and then maximising this modified likelihood with respect to τ 2 . We describe both approaches in turn.

Method of moments.
We first describe this for pairwise meta-analysis [88,101]. Inference of the regression coefficients in pairwise meta-analysis involves a weighted mean of the sample (equation (58)). For general weights a i (i = 1, . . . , M) associated with observations y i we define the weighted mean, If we set a i = (σ 2 i + τ 2 ) −1 then we haveŷ =d RE , i.e. we recover the random effects estimate in equation (58). Other choices of the weights will be discussed below.
The residuals associated with this weighted mean are given by y i −ŷ. The so-called 'Q statistic' [5,88,93] is then defined as the weighted sum of squared residuals, The estimate of τ 2 is obtained by assuming that the empirical value of Q obtained via equation (61) from the observed data is equal to its expectation under the random effects model, In appendix E.1 we show that this leads to the general method of moments estimator, In practice, the expression on the right-hand side of equation (62) can come out negative. In this case one setsτ 2 = 0. Different choices can be made for the weights a i in equations (60) and (61). For example, the widely used DerSimonian and Laird (DL) estimator [5] uses the fixed effect weights, a i = σ −2 i , so thatŷ =d FE . Cochran's ANOVA (CA) estimator [103] uses equal weights a i = 1/M while the Paule Mandel (PM) estimator [92] uses the random effects weights a i = (σ 2 i + τ 2 ) −1 . The DL and CA estimators lead to a closed form solution for the estimate of τ 2 (in the sense that the right-hand side of equation (62) becomes independent of τ 2 ). This is not the case for the PM estimator since these weights depend on τ 2 . Equation (62) must then be solved numerically.
Extending the DL estimator to the case of NMA is straightforward [97,99,100]. We generalise equation (61) using the inverse-variance weight matrix and obtain We recall that V −1 represents the observed within-study variances and correlations so the expression in equation (63) is analogous to using a i = σ −2 i in the pairwise case. The vectorŷ is the set of network estimates of y obtained using the fixed effects weights V −1 . That is

J. Stat. Mech. (2022) 11R001
Network meta-analysis: a statistical physics perspectivê whered FE is the vector of estimates of the mean relative treatment effects under the fixed effect model, obtained by setting W = V −1 (i.e. Σ = 0) in equation (56). Equations (63) and (64) are the NMA analogue of equations (60) and (61) in pairwise MA (when a i = σ −2 i ). Similar to the pairwise case we equate the empirically observed value of Q in equation (63) with its expectation under the RE model. As shown in appendix E.2, this leads tô where the matrix P is defined via Σ = τ 2 P. The matrix A is defined in equation (E. 13) in appendix E.2. Equation (65) is the DerSimonian and Laird estimator of τ 2 in NMA. Again, ifτ 2 comes out negative we set its value equal to zero. Restricted maximum likelihood . We now outline the REML method for estimating the variance parameter. We start by discussing this method in a more general sense (for a linear regression problem) and then relate this to the NMA model. Mathematical details are provided in appendix G.
In section 4.2.3 we showed that we can obtain estimates for the regression coefficients β in a linear model of the form y = Xβ+ ε with ∼ N (0, C) by maximising the log likelihood in equation (47) with respect to these parameters. This led to the Aitken estimator in equation (51).
The likelihood of this model is also a function of the variance parameters φ characterising the covariance matrix C (in the case of NMA this would be τ 2 ). The Aitken estimator assumes that these are known, which is generally not the case. One way of obtaining the variance parameters is to maximise the log likelihood simultaneously with respect to β and φ.
A problem with this approach is that the resulting estimates of the variance parameters are generally biased [102]. This is because the variance estimate fails to account for the loss in degrees of freedom that results from estimating β [104]. This is well known, and can be demonstrated easily for a one dimensional normal distribution, see appendix F.
The REML approach was proposed by Patterson and Thompson [102] as a method to overcome this problem. The principal idea is to carry out a linear transformation of the variables y so that the likelihood function for the transformed variables no longer depends on the parameters β, but only on their estimatesβ (which in turn depend on φ). This 'restricted likelihood' is then maximised with respect to the variance parameters.
As described in more detail in appendix G, the restricted log likelihood is given by [104,105] ln (RL(φ|y, X) This is similar to the original log likelihood function in equation (47) but with dependence on the maximum likelihood estimatorβ (given in equation (51)) instead of the true parameter β and with an additional term. A possible iterative procedure to obtain estimates of the regression and variance parameters is now as follows: start with an initial choice forφ. This defines C. Use this in equation (51) to obtainβ. Use this value forβ in equation (66) and then maximise ln(RL(φ|y , X)) with respect to φ (keepingβ constant). This delivers an updated value forφ. Then repeat, and iterate until convergence.
In this process, the maximisation of the expression in equation (66) with respect to φ requires numerical techniques such as the Newton-Raphson method [106], Fisher's scoring algorithm [107], or the expectation-maximisation algorithm [108].
In NMA the parameter estimatesβ are the estimates of the relative treatment effectŝ d RE in equation (56). The covariance matrix C is given by V + Σ where the withinstudy covariance matrix V is assumed known, and the between-study covariance matrix depends on the unknown parameter τ 2 we wish to estimate.
In pairwise meta-analysisβ is the estimate of the treatment effectd RE in equation (58), X is an M × 1 matrix of ones, and the covariance matrix C is diagonal with elements σ 2 i + τ 2 (with σ 2 i assumed known). Further details can be found in appendix G.1.
Of the different heterogeneity estimators described here, REML is usually the recommended option (see for example [93,94]).

Software and example code
Popular software for implementing frequentist NMA include the mvmeta package [109,110] in STATA and the netemta package [111] in R. In section 5 we use netmeta to analyse the thrombolytic drug data discussed earlier in section 2.1.4 and illustrated in figure 1. The R code used to perform this analysis is provided in a GitHub repository, see [112].

Reporting the results of a network meta-analysis
In sections 3 and 4 we have described how to obtain estimates for the model parameters (d , τ ) in NMA using Bayesian and frequentist methods. We now explain how these estimates are reported and summarised for use in decision making.

Obtaining all relative treatment effects
At the end of the inference process one reports an estimate for each parameter in d along with a measure of precision. In frequentist inference the parameter estimated is the one discussed in section 4.3.1, and an estimate of the variance is obtained from equation (57). In Bayesian inference the parameter estimate is usually the mean or median of the samples from the posterior distribution, and we also record the variance of the samples for the parameter [17].
We recall that the parameter vector d contains the N − 1 basic parameters, i.e. the effects of treatments a = T 2 , . . . , T N relative to the global baseline treatment T 1 . To obtain the N(N − 1)/2 estimates of the relative effects between every pair of treatments,d ab , we substitute the estimated basic parameters into the transitivity relation in equation (8), and find In the frequentist setting, the associated variance is then given by where the variances and covariances of the basic parameters are estimated via equation (57).
In the Bayesian setting, one can estimate the parametersd ab at each MCMC iteration using the transitivity equation. One then records the mean (or median) and variance of these samples.

Confidence/credible intervals in frequentist and Bayesian inference
In a frequentist setting uncertainty on a parameter is often expressed in terms of confidence intervals, for example a '95% confidence interval'. In Bayesian inference uncertainty is expressed in terms of 'credible intervals'. We note the subtle difference between these two concepts. The Bayesian interpretation is intuitive: given the observed data, there is a 95% probability that the true (unknown) parameter lies within this interval [45]. In a frequentist setting this would mean that if we were to repeat the experiment and inference many times (each time constructing a 95%-confidence interval) then 95% of these intervals would contain the true value of the parameter [45].
Let x represent a parameter estimated from the NMA process. The ζ% confidence interval (0 ζ 100) is constructed from the parameter estimate and its variance assuming a Gaussian distribution for the parameter with meanx and variance Var(x). More precisely, where q(ζ) is such that a total probability of ζ% of the Gaussian distribution is in the interval of length 2q(ζ) Var(x) around the mean. (Scaling out the variance, this means q −q dx e −x 2 /2 / √ 2π = ζ/100.) For example, using q = 1.96 in equation (69) indicates a 95% confidence interval. In a Bayesian setting the 95% credible interval can be obtained in a similar way from equation (69), but using the mean and variance of samples drawn from the posterior. Alternatively, one can calculate the 2.5% and 97.5% quantiles of the posterior samples. Equation (69) assumes a Gaussian distribution for the parameter of interest. Therefore, this procedure cannot be applied to obtain confidence intervals for the heterogeneity parameter, τ (which must be non-negative). In a Bayesian setting, one can construct a credible interval for τ from the upper and lower quantiles of the samples drawn from its posterior distribution. Frequentist methods to construct a confidence interval for the  [10][11][12]. The network graph and treatment labels are shown in figure 1. The global baseline treatment is T 1 and has a LOR of 0 by definition. The outcome of interest is the number of deaths that occur within 30 or 35 days of a heart attack. Therefore a negative LOR indicates that the treatment is more effective than the baseline T 1 . The horizontal lines indicate the 95%-confidence intervals about the estimated LORs. Figure was created using the software netmeta [111], relevant R code can be found here [112]. (The grey boxes highlight the central value and do not convey any additional information.) heterogeneity parameter, on the other hand, are more involved. See [94] (and references therein) for an overview of these methods and refer to [113] for recommendations of the best approaches.

Forest plots
A common way to present the relative treatment effect estimates in both frequentist and Bayesian NMA is on a forest plot [109,111,114]. The NMA estimate of each basic parameter d T 1 a is represented by a horizontal line centred on its estimated valuê d T 1 a with length equal to its confidence (or credible) interval. For relative treatment effects measured as LORs a value of zero represents no difference in treatment effect. When the outcome from the trials is the number of negative events or 'failures', a LOR d T 1 a < 0 indicates that treatment a is more effective (i.e. produces fewer failures) than the baseline T 1 and vice versa. A forest plot showing the results for a frequentist analysis of the thrombolytic drug data in section 2.1.4 is shown in figure 6. A 'caterpillar plot' is also sometimes used [13]. This is essentially the same as a forest plot but the relative treatment effects are sorted in order of increasing effect size [115]. 9 The forest plot shown in figure 6 shows only the basic parameter estimates, i.e. the treatment effects relative to the global baseline treatment T 1 . In principle one could represent all N(N − 1)/2 relative effects on the plot but this quickly becomes cumbersome 9 One also comes across forest plots in the context of a pairwise meta-analysis. In this scenario each horizontal line represents the outcome as measured in each trial. The meta-analysis estimate is then shown at the bottom of the plot as a diamond centred on the estimated value with width equal to its CI. The NMA forest plot differs from this in that the results from individual trials are not shown. Instead each horizontal line represents a different relative treatment effect estimated from the NMA. as the number of treatments increases. The full set of treatment effects may instead be represented in a 'league table' [116]. This is an N × N matrix whose elements represent all the relative effect estimatesd ab .

Ranking
The aim of an NMA is to provide clinicians with a clear statistical summary of all relevant data so that they know what the most desirable treatment options are for a particular condition. Relative treatment effect estimates and their confidence/credible intervals can be difficult to interpret and draw conclusions from, especially when many treatments have been compared [117,118].
Ranking treatments from best to worst based on their relative effect estimates is the simplest way of summarising the results of an NMA. For example, from the estimated treatment effectsd T 1 a in the forest plot in figure 6 the treatments would be ranked from best to worst in the order T 7 , T 8 , T 3 = T 6 , T 5 , T 4 , T 1 = T 2 , T 9 (where we have written a = b for treatments that are observed to be equally effective within the reported number of digits for the treatment effects).
However, this summary does not account for the level of overlap between the confidence intervals or the similarity between the point estimates. For example, in the thrombolytic drug data set treatment T 8 is ranked second best using this method despite the fact it has a very large confidence interval that covers almost the entire width of the other intervals. On inspection of the forest plot we cannot draw any meaningful conclusion about the effectiveness of treatment T 8 but this fact is not reflected in its rank.

Rank probability and rankograms. .
A more sophisticated method of ordering the treatments is to calculate so-called rank probabilities [117]. That is, we calculate the probability that each treatment is best, second best and so on. We use the notation P a (r) to represent the probability that treatment a has rank r. These quantities are only meaningful in a Bayesian framework where probability describes the degree of belief in parameter values. In a Bayesian NMA, treatments are ranked at each iteration of the MCMC according to the values of relative treatment effect sampled at that iteration. The probability P a (r) is then estimated from the proportion of times treatment a was ranked rth.
Although rank probabilities do not strictly make sense in a frequentist framework, so-called 're-sampling' methods have been developed to produce estimates of rank probabilities based on the results of a frequentist NMA [49,110]. Essentially, this involves assuming that the distribution of the model parameters can be approximated by a normal distribution with mean and covariance matrix equal to the values estimated from frequentist methods (section 4.3). Values of relative treatment effect are sampled multiple times from this approximate distribution and rank probabilities are estimated in the same way as before.
The rank probabilities can be displayed graphically using 'rankograms' [117]. For each treatment, the rank probabilities P a (r) are plotted against the rank r either as a bar chart or a line graph. The rankograms for treatments in the Thrombolytic drug data set are shown in figure 7 as bar charts. Rank probabilities were obtained from frequentist re-sampling methods (based on 1000 simulations) using netmeta [111], accompanying R code is provided here [112].
Ranking using probabilities reflects not only the point estimates of relative treatment effects but also the uncertainties on these estimates and their overlapping confidence/credible intervals. Clearly, the more overlap between intervals, the flatter the rankograms will be. For example, in the Thrombolytic data set, while treatment T 8 has the highest probability of being second best, its rankogram is relatively flat indicating the uncertainty in its rank.

SUCRA and P-scores. .
The number of rank probabilities for a network increases with N 2 . Therefore rank probabilities and rankograms become increasingly difficult to interpret as the number of treatments increases [118].
Instead of rankograms, we could instead plot the cumulative probability F a (r) against rank r to obtain cumulative ranking curves. Here, F a (r) is the probability that treatment a has rank r or better, A simple summary of rank probabilities is then the area under these curves. Salanti et al [117] termed this measurement the 'surface under the cumulative ranking line' or SUCRA. The value of SUCRA for a particular treatment a is then where E(r) a is the mean or expected rank of treatment a, SUCRA takes values from 0 to 1, though these are often expressed as a percentage. If treatment a ranks first with probability one then it will have SUCRA a = 1 (or 100%) whereas a treatment that ranks worst with probability one will have SUCRA a = 0 (or 0%) [117]. SUCRA a can be interpreted as the average proportion of treatments worse than a. SUCRA values provide a concise summary of treatment rankings that accounts for the estimated relative treatment effects, uncertainty in these estimates, and the resulting overlap in their confidence/credible intervals. In frequentist NMA values of SUCRA can be calculated from the rank probability estimates obtained via re-sampling methods. Alternatively, Rücker and Schwarzer [119] proposed an analogous quantity called a 'P score' that does not require re-sampling. By assuming a normal distribution for the model parameters they define where Φ(x) = 1 2 1 + erf(x/ √ 2) is the cumulative distribution function of the standard normal distribution. This is interpreted as the extent of certainty thatd ab > 0 (i.e. that a is more effective than b). The P -score for treatment a is then the mean of F ab over treatments b = a. This is the mean extent of certainty that a is more effective than any other treatment. Rücker and Schwarzer [119] showed that when the true probabilities are known, P -scores and SUCRA values are identical. In practice they give very similar results.

Assessing inconsistency
The critical assumption underlying NMA is the idea that we can combine direct and indirect evidence in the network. That is, that the relative treatment effects follow the transitivity relation (equation (8)). If the transitivity assumption is borne out by the data, then the network is thought to be 'consistent' [120]. 'Inconsistency' is then a measure of the extent of disagreement between direct and indirect evidence in the network.
Assessing the degree of inconsistency is an important part of undertaking an NMA. The development of methods to perform this evaluation is an active area of research within the meta-analysis community [12,24,27,36,96,99,120,121]. As this is an introductory review we do not explain these methods in detail but instead point to some important resources for further reading. In addition, refer to [24,27,120] for useful review articles on this topic.
The methods for assessing inconsistency can be split broadly into two categories; local and global methods [24]. Local methods focus on specific treatment comparisons for which there is both direct and indirect evidence, whereas global approaches attempt to quantify the level of inconsistency present in the network as a whole.
Local approaches for assessing inconsistency on a particular treatment comparison ab involve separating the evidence into that coming from direct comparisons (i.e. trials comparing a to b) and evidence from indirect comparisons (e.g. sets of trials comparing a and b to some common comparator treatment or treatments). One can then test the level of agreement between estimates that arise from the two types of evidence. Examples of these methods include the loop specific approach [27,122], the back calculation method [12] and the 'node splitting' or 'leave one out' approach [12]. An advantage of local methods is that they allow one to identify 'hot spots' of inconsistency in the network that can then be investigated and (hopefully) addressed.
Global approaches aim to quantify the degree of inconsistency present in the entire network of evidence. One class of methods that has been proposed involves incorporating additional parameters into the NMA model which aim to encapsulate any potential inconsistency. For example, in the Lu and Ades model [36] one rewrites the transitivity equation as (74) where the w abc parameter quantifies the level of inconsistency between direct and indirect evidence of comparisons between treatments a, b and c. The different inconsistency parameters across the network are assumed to follow a common distribution and global inconsistency is assessed by testing the null hypothesis that the parameters are equal to zero [24,36]. Other similar methods include the design-by-treatment inconsistency model [49,123,124] and the unrelated mean effects model [27]. An alternative approach to assessing global inconsistency is based on the heterogeneity Q-statistic (see section 4.3.2) in the context of the 'two-step' or 'aggregate' NMA model [125]. In the first step of this model, evidence from all trials making the same comparisons is combined. Then in the second step, these direct estimates are used as the observations in an NMA. The Q-statistic is the weighted sum of squared residuals between observed treatment effects and those estimated from the model (see e.g. equation (61)). The value of Q associated with the first step of the aggregate model measures the heterogeneity within trials that compare the same treatments. The Q-statistic associated with the second step quantifies inconsistency between trials that compare different combinations of treatments. The latter quantity can therefore be used as a test of global inconsistency, see [126] for further details.
Any inconsistency detected in the network should be investigated in order to identify the source of disagreement and to assess how this impacts the results of the NMA. Models can then be defined that relax the consistency assumption or one can incorporate covariates into a regression type NMA model [127,128]. Here, covariates represent effect modifiers that are likely to explain the disagreement between different types of evidence. If 'too much' unexplained inconsistency is identified then one may choose not to perform an NMA at all as the evidence is considered too different to produce a meaningful result [120].

Existing points of contact between network meta-analysis and physics
In the previous sections we have introduced some of the essential concepts and methods for NMA. In the remainder of the paper we now discuss how physics (in particular, statistical physics) and physicists can contribute to this area.
For example, a number of analogies between meta-analysis and specific physical systems have been proposed in recent years. These analogies have provided insight, and they have helped to improve meta-analysis methodology and the visualisation of the problem. In this section we briefly outline these analogies.
Section 7 then describes a few examples of more general methods used in statistical physics which have been shown to be useful in a meta-analysis context. We also present a number of more speculative ideas on how knowledge from physics might be used for NMA.

Network meta-analysis and electrical networks
Arguably the most influential of the meta-analysis analogies was developed by Rücker [50] who demonstrated the connection between NMA and electrical network theory. The starting point for this analogy is the observation that variance in meta-analysis combines in the same way as resistance in an electrical network.

Variances in network meta-analysis combine like resistance in a network.
In section 4.3.1 we saw that for a frequentist pairwise meta-analysis, the variance of the estimated treatment effectd is an expression for the variance of the weighted mean (equation (59)) calculated in terms of the variances associated with each of the trials. Taking the reciprocal on both sides of this equation and writing v i for the variance associated with the measurement in trial i we find For a random effects model we haved =d RE and v i = σ 2 i + τ 2 , whereas for a fixed effect modeld =d FE and v i = σ 2 i . As illustrated in figure 8(a), a pairwise meta-analysis can be represented by a graph with two nodes (representing the two treatment options) and multiple parallel connections (edges) between the nodes (representing the individual trials comparing the treatments). The same graphical representation describes an electrical network with resistors connected in parallel ( figure 8(b)). The effective resistance, R, of a set of M parallel resistors R i , i = 1, . . . , M is Therefore, resistors in parallel combine like variances in a pairwise meta-analysis. Now consider the network in figure 8(c) comprising three treatments, T 1 , T 2 , T 3 , and two trials. The first trial i = 1 compares treatments T 1 and T 2 and measures a relative treatment effect y 1,T 1 T 2 with variance v 1 . Trial i = 2 compares treatments T 2 and T 3 and measures y 2,T 2 T 3 with variance v 2 . The network estimates of the relative effect between treatments T 1 and T 2 , and between T 2 and T 3 , respectively, are simply the direct estimatesd T 1 T 2 = y 1,T 1 T 2 andd T 2 T 3 = y 2,T 2 T 3 . There is no direct evidence for the comparison of treatments T 1 and T 3 . The network estimate of this comparison is obtained from an indirect estimate via T 2 , Since the trials are independent, the variance associated with this estimate is As shown in figure 8(d) this set-up relates to an electrical network with resistors connected in series. The effective resistance for this network is Therefore resistors in series combine like variances for indirect estimates in NMA. The analogy with resistor circuits can be extended to networks of more than two trials, and those combining both parallel connections and connections in series. Through a more detailed analysis (which we do not describe here) and by comparing Ohm's law to the weighted mean expression in equation (56), one can then establish a mapping between relative treatment effects and potential differences (voltages).

Analogy of the network meta-analysis problem in electric circuits. .
In electrical network theory, graph theoretical methods are used in different ways, for example to construct the electric potentials at the nodes from external currents, or to compute the effective resistance between two nodes from the resistors in the network. Rücker showed that a similar set of methods can be used in NMA to derive an expression for the network estimates of the relative treatment effects from the observed effects. This leads to the same results as the Aitken estimator in equation (58) [50,129].
We do not present details here, but the core of the analogy can be described as follows [50]. The NMA problem consists of finding the network estimates for the relative treatment effects using the effects observed in the trials (for a known network structure and known (inverse-variance) weights of the trials). The observed effects will in general be inconsistent, whereas the estimates resulting from the NMA are consistent by construction. The observed effects translate to 'observed' voltages across the resistors in the network (the resistances are determined by the inverse-weights of the trials). As a consequence of the inconsistency, voltages along loops in the network will not add to zero. This means that no electric potentials can be assigned to the nodes from which the voltages would arise as potential differences. The key result is now that the problem of determining the NMA estimates for the treatment effects is equivalent to finding the set of electric potentials so that the resulting (consistent) voltages best approximate the observed (inconsistent) voltages 10 . Quality of approximation is here measured in terms of the Euclidian norm.

Reduction of multi-armed trials. .
One particularly useful application of the electrical network analogy is in the context of networks with multi-arm trials. Measurements of the different relative treatment effects from a multi-arm trial are correlated, and the presence of such correlations can cause complications for some NMA methodology. One therefore carries out a 'reduction' to an equivalent set of two-arm trials. We here briefly describe this, for details see [50,129].
We focus on a single multi-arm trial with m-arms. The corresponding (sub) network involving the m treatments is then fully connected. The idea is to find a network consisting of m(m − 1)/2 pairwise trials which is 'equivalent' to the multi-arm trial in the sense that the variances of the network estimates of relative treatment effects in the pairwise network (given by equation (57)) are the same as the variances in the graph describing the multi-arm trial.
As discussed above, variances in NMA combine like resistances in electric networks, i.e. the variances of network estimates are obtained from the individual trial variances in the same way as effective resistance is obtained from the physical resistors in electric circuits 11 . The reduction problem for the multi-armed trial is therefore equivalent to finding the individual resistances {R ab } in an electric network given the effective resistances between pairs of nodes. It is well known (see e.g. [130]) that where L is the graph Laplacian describing the electric network, and where + denotes the pseudo-inverse. The graph Laplacian is defined by the individual (physical) resistors via L ab = −R −1 ab for a = b, and L aa = b R −1 ab . The reduction problem therefore maps onto the problem of finding the elements of the graph Laplacian L in equation (80) for given effective resistances {R eff ab } (i.e. given variances in the multi-armed trial). The individual physical resistors (variances associated with the individual two-armed trials) can then be extracted from the off-diagonal elements of the Laplacian. Further details of the reduction method are given in appendix H.
We perform this reduction for every multi-arm trial in a meta-analytic network, and use effect estimates from the multi-arm trials to assign estimates to the two-arm trials. This leads to a network of two-arm trials that is equivalent to the original network (i.e. it produces the same network estimates and variances). Methodology that does not allow for correlations can then be used on this new network.
Further details of the analogy between NMA and electrical networks can be found in [50,129].

Random walks
Random walks are a familiar concept to statistical physicists. Random hopping processes on networks are of particular interest in a number of areas [131][132][133]. In brief, a random walk on a network is a stochastic process describing a series of hops between nodes that are connected by an edge. We focus on discrete time such that each time step is associated with one hop across an edge.
There is a well-known analogy between random walks and electrical networks [134][135][136][137], briefly summarised in the next section. Using the work of Rücker [50] described in section 6.1 we were able to extend this analogy to NMA [16].

Random walks and resistor networks. .
Each edge in an electrical resistor network has an associated resistance. Given such a network, one now constructs a random walk as follows: for a random walker currently at node a, the probability with which the walker hops from a to b in the next step is proportional to the inverse-resistance associated with the edge ab. More precisely, one defines the transition matrix elements as where R ab is the resistance of the resistor connecting nodes a and b. Various physical quantities in the electric network then have interpretations in the random-walk picture. A good summary can be found in [137]. For example, consider the following scenario: a battery is attached to two nodes a and b in the resistor network. We assume that the network only has one single connected component. The voltage of the battery is chosen such that one Ampère of current goes into node a from the battery (and consequently one Ampère goes from node b back into the battery). This then induces currents I cd through all edges cd (the resistors) in the network. Now imagine we release a random walker at node a, and it performs the random walk defined by equation (81). We stop the walk when the walker reaches node b (this will happen eventually given that the network consists of a single component). We can record the net number of times the walker will have crossed edge cd before it reaches node b (hops from d to c contribute negatively to this value). One can then show [137] that the expected net number of crossings from c to d is given by the current I cd in the electric network with the battery attached to nodes a and b.

Random walks and flow of evidence in network meta-analysis. .
Starting from the existing analogies between electrical networks and NMA on the one hand, and electrical networks and random walks on the other, we (along with Rücker, Papakonstantinou and Nikolakopoulou) [16] proposed an analogy between NMA and random walks. In the following, we briefly summarise the main ideas.
We have seen that resistance in an electrical network is analogous to variance in an NMA. Therefore inverse-resistance is associated with inverse-variance weight. Writing w ab for the weight associated with edge ab in an NMA (see section 4.3.1), we define the transition matrix of a random walker via This is illustrated in figure 9.
In [16] we found that the expected net number of times a walker crosses each edge is equal to the so-called 'evidence flow' through that edge. This is a concept introduced by König et al [126]. We do not give full definitions here. Broadly speaking the flow variable f (ab) cd is a coefficient that describes how much the observed relative effect between treatments c and d contributes to the network estimate of the relative effect between a and b. The coefficients are related to the entries of the matrix on the right-hand side of equation (56). They describe how different pieces of direct evidence in an NMA combine to give the overall network estimates of relative treatment effects. It turns out [126] that these coefficients have the properties of a flow. For example, the flow out of a node c = a, b, x f cd } can be used to define a directed weighted graph. This is for a fixed choice of a and b meaning there is one directed graph for each comparison ab. Nodes in this graph again represent the treatment Figure 9. An illustration of the analogy between NMA and random walks. (a) An NMA with five treatments a = T 1 , T 2 , T 3 , T 4 , T 5 . Each edge is labelled with the inverse-variance weight associated with that treatment comparison, w ab . (b) The transition probabilities for a random walker on the network in (a) who is currently at node T 1 . At the next time step this walker can move to node T 2 , T 3 or T 5 with probabilities proportional to the edge weights.
options, but the weights of the (now directed) edges are given by the flow of evidence {f (ab) cd }.

Random walks, streams of evidence and proportion contributions. .
In [16] we then defined a second random walk. All walkers start at a and move on the directed graph just described. The construction is such that walkers can never return to a node they have already visited (the graph is acyclic), and all walks end at b (node b is absorbing). We can think of these walkers as collecting evidence along their way. They start at node a, hop to intermediate nodes and record differences in treatment effects (similar to differences in altitude when walking in a landscape). When a walker arrives at b it reports the total difference in altitude it has experienced. Due to inconsistencies, this reported difference may be different along different paths connecting a and b. The average of what the walkers report turns out to be the network estimate of the relative treatment effect between a and b [16].
In addition, the probability of a walker taking a certain path from a to b is given by the product of the transition probabilities in the edges along that path. This expression can be used to calculate so-called 'streams of evidence' [138], and what is referred to as the 'proportion contribution matrix' [139]. In particular, we used the random walk transition probabilities to derive an analytical expression for the contribution each treatment comparison makes to each overall treatment effect estimate. As discussed in more detail in [16], this random walk approach overcomes some limitations of previous algorithms used to construct this quantity. The random-walk method has recently been implemented in the software package netmeta [111].
The random walk analogy is a recent addition to NMA. As a result, only a small proportion of the random walk literature has been explored in this context. We therefore feel that there is scope to extend the analogy. We hope that the introduction to NMA we give in this paper will help other statistical physicists with an interest in random walks on networks to join these efforts.

System of springs
Papakonstantinou et al [140] visualised the meta-analysis problem as a system of springs. In a similar vein to the electrical network analogy, they observe that when connecting systems of springs the inverse of the spring constant combines in the same way as variance in meta-analysis. Unlike the electrical network framework however, the spring analogy refers only to pairwise meta-analysis and simple indirect estimates.
Hooke's law states that the force, f, needed to displace a spring by length x is f = kx, where k is a measure of the stiffness of the spring known as the spring constant. The potential energy stored in such a spring is Consider a set of M springs fixed at one end and arranged in parallel as shown in figure 10(a). Each spring has a different spring constant, k i , and a different natural length, l i . The open ends of the spring are then connected so that the springs are forced to assume the same length, called the 'effective length' l of the spring. Each spring has therefore been displaced by a different amount, x i = l i − l. The resulting equilibrium length of the springs,l , is such that the total force on T 2 vanishes, i.e. i k i x i = 0. (Equivalently, this length minimises the energy stored in the system, This expression is in the form of a weighted mean. Comparing it to the estimate from a pairwise meta-analysis in equation (58), one observes that we can draw an analogy between the parallel system of springs and a pairwise MA. We interpret the spring constant as inverse-variance weight. The spring constant associated with the 'effective spring' is Comparing this to equation (75), it is clear that variance in a pairwise meta-analysis combines in the same way as the inverse of the spring constant for a set of parallel springs.
One can also make this analogy for springs connected in series. As shown in figure 10(b), the effective length of a chain of springs connected together is simply the sum of their natural lengths. The effective spring constant is then The open ends are then forced to the same length so that their natural lengths are displaced by some distance. This is equivalent to a pairwise meta-analysis. (b) A system of springs connected in series. This is equivalent to an indirect comparison in meta-analysis. We label each treatment as T a . Each spring is labelled by its natural length, l i , and the inverse of its spring constant, k −1 i . l is the effective length of the spring system. The different thicknesses of the springs represent their different spring constants.
Comparing this to equation (78), we observe that the inverse of the spring constants of springs connected in series combine like variances for an indirect estimate in meta-analysis. The natural length of each spring can be interpreted as the measurement of the relative treatment effect in each trial. The displacements x i then relate to the residuals associated with the relative treatment effects (i.e. differences between the relative effects measured in each trial and those predicted by the model). Minimising the energy is analogous to the process of minimising the sum of squared weighted residuals described in sections 4.2.1 and 4.2.2.
This analogy provides a useful visualisation of the meta-analysis process. That is, combining the data from multiple medical trials is like minimising the energy in a system of springs. The energy stored in the final equilibrium system then represents a measure of disagreement between the different pieces of evidence. Though this analogy has been demonstrated for a set of relatively simple configurations, it has not yet been extended to a general NMA. This would require a much more complex spring system. Further details can be found in [140]. Figure 11. An illustration of pairwise meta-analysis as a centre of mass problem. Each mass is labelled by its position along the rod, x i , and the force acting on it, m i g, which is equal to its weight. The position of each object represents the measurement of relative treatment effect in each trial. The physical weight of each object represents the inverse-variance weight of each measurement. The centre of mass, x CoM , is the position of the pivot that balances the torques. Finding this position is equivalent to finding the estimate of relative treatment effect in a pairwise meta-analysis.

Balance of torques in a mechanical system
Another visualisation of meta-analysis based on a mechanical system was proposed by Bowden and Jackson [141]. The framework was developed only for the pairwise case and has not yet been extended to include indirect estimates or NMA. In this analogy, the process of finding the minimum sum of weighted residuals (or, equivalently, maximising the likelihood, see sections 4.2.2 and 4.2.3) is equated to balancing torques in a system of weights. This is the same as finding the position of the centre of mass. Figure 11 shows a mechanical system consisting of a bar with M objects of different masses, m i , hanging at various positions x i along the bar. The bar is supported by a pivot. The system of masses (or weights) is balanced when the pivot is placed such that the torques exerted by the masses balance, i.e. M i=1 m i g(x i − x pivot ) = 0, where g is the acceleration due to gravity. Writing the weights of the different masses as w i = m i g the balance of torques leads to i.e. the pivot must be located at the centre of mass (in one dimension) defined by the system of weights. As before, we compare this to the pairwise meta-analysis estimate in equation (58) in order to establish the analogy. The position of each mass along the bar then represents the observed relative treatment effect in each trial and the physical weight of each object represents the inverse-variance weight associated with each observation.
The problem of finding the position of the centre of mass provides a visual representation of finding the best estimate of the relative treatment effect in a pairwise meta-analysis. Bowden and Jackson used this visualisation to create an online visualisation tool. The visualisation shows how different modelling assumptions (for example, fixed effects vs random effects) or the addition/removal of certain studies affect the position of the centre of mass. When the user makes a change to the model or the data, an animation shows the change in balance of the system. The visualisation tool was found to help identify the presence of small-study bias, a phenomenon where smaller studies report a systematically larger relative treatment effect than larger studies [142]. For details, see [141].

Using network theory to characterise meta-analytic graphs
A natural point of contact between NMA and statistical physics is the theory of networks. It is widely recognised that the accuracy and precision of NMA outcomes are likely to be affected by network topology. Indeed, researchers are encouraged to provide graphical and qualitative descriptions of network geometry [143]. Based on concepts from the theory of networks, researchers have defined topological indices to describe the geometry of meta-analytic networks and related these to the outcomes of NMA [81,144,145]. We give a few examples in this section, recognising that this existing work is only an initial step. We think that there is significant scope to extend these activities.
In network theory, the degree of a vertex is the number of edges the vertex shares. A graph is described as 'regular' if all vertices have the same degree. From these definitions we [81] defined a measure of 'degree irregularity' of the graph of treatment options. This measure is given by 12 Returning to the tongue-in-cheek statement by Dietrich Stauffer [3] quoted in the introduction, we are not suggesting that physicists 'know everything better', or that physicists use particular methods which are entirely unknown to medical statisticians. But it is probably fair to say that physicists approach new problems in ways that are different than in other areas. In this sense we see Markov chain methods for the ranking of treatments and simulation techniques for NMA as continuously evolving fields where collaboration between physicists and medical statisticians might produce original ideas for future developments.
and quantifies the variation in the number of studies involving each treatment. We define k a as the weighted degree of node a (here, the weight of an edge is given by the number of trials comparing the two treatments connected by the edge). The quantityk = 1 N a k a is the mean degree in the network. Through simulations of NMA, we found that smaller values of h 2 were associated with more precise treatment effects and smaller bias on rank probabilities.
Tonin et al [144] adapted metrics from graph theory in order to numerically describe network geometry in NMA. They performed a systematic review of 167 published NMAs and used 11 metrics to describe the topology of each network. By performing a sensitivity analysis on each metric and assessing the level of correlation between the metrics, they identified four indicators that were the most useful for describing network geometry. These are (i) density: a measure of 'connectedness' equal to the number of edges in the network divided by the total number of edges possible for N vertices; (ii) percentage of common comparators: the percentage of vertices with more than one connecting edge relative to the total number of vertices N; (iii) percentage of strong edges: the percentage of edges with more than one study relative to the total number of edges; (iv) median thickness and dispersion measure: the median number of studies per edge and the IQR (interquartile range).
Tonin et al recommended that in order to characterise the network of evidence from an NMA these measures, in addition to the number of vertices, the number of edges, and the number of trials per edge, should be reported alongside the network graph.
This existing work shows that concepts used in network theory to characterise the topology of networks can be linked to the outcome of NMA. We think that further work to explore these ideas will be very worthwhile. Significant expertise exists in the complex networks community, and a large number of tools have been developed to characterise, for example, the clustering properties of networks, as well as measures of centrality and assortativity. Techniques are available to detect communities in networks, and more general motifs. The development of methods for systems with higher-order interactions and hypergraphs is very much in development [146]. NMA might be phrased in this language as multiple treatment options 'interact' in each trial (see also figure 2(d)).
It will be exciting to see if and how these concepts can be used to study networks of medical trials. Finding good indicators of network topology for NMA performance also leads to the possibility to suggest targeted additions to an existing meta-analytic network. In other words, these methods from the theory of (complex) networks could be used to propose future trials to be added to a meta-analysis that promise to be the most useful for improving the overall estimates of treatment effects. This could include concepts from chemical graph theory, e.g. topological indices discussed in [147].

Network meta-analysis, constrained optimisation and statistical mechanics
Counting is very much at the heart of statistical physics. Boltzmann's entropy, S = k ln Ω, is defined based on the number of accessible microstates Ω of a system. Establishing the statistical physics of a system then boils down to accurately counting the number of microstates associated with a given macrostate. In the canonical and grandcanonical ensembles microstates are 'weighted' by the appropriate Boltzmann factors, and the counting is formalised in the canonical and grand-canonical partition functions. Anyone who has taken a course in statistical physics will remember the subtleties associated with determining the number of ways a given number of Fermions or Bosons can be distributed across a set of energy states.
Thermodynamics and statistical physics are also closely related to extremisation principles. Depending on the exact circumstances and external constraints, a thermodynamic system will extremise the appropriate thermodynamic potential. We also highlight the information-theoretic approach to statistical physics, beautifully introduced by Jaynes [148,149]. In this formalism the canonical and grand-canonical ensembles are obtained from maximing uncertainty (as measured by Shannon entropy) under the constraints defining these ensembles.
It is no surprise that connections can be established to other areas facing counting and extremisation problems. Examples are the closely related areas of 'constrained optimisation' and 'constraint satisfaction' in mathematics and computer science. Constrained optimisation problems involve finding the minimum (or maximum) of a function f(x ), subject to constraints on the variable x = (x 1 , . . . , x n ). These constraints can come as a set of relations connecting the x i , such as inequalities or as a combination of equalities and inequalities. One example is graph partitioning, i.e. the problem of partitioning the set of nodes of a network into a given number of subsets while minimising the number of edges between these subsets. Another instance of constrained optimisation is the 'knapsack problem' (see e.g. [150]).
Constraint satisfaction problems are problems in which the variable x must satisfy a set of constraints. Often these are formulated on graphs. A good example is the graph colouring problem (more precisely, we describe the vertex colouring problem). Assume we have a graph consisting of N nodes, connected by a set of links. The problem now consists of assigning a colour to each node so that no two neighbouring nodes have the same colour. The x i , i = 1, . . . , N here represent the colours, and the constraints are local (each node of degree k 1 in the network leads to a constraint involving this node and its neighbours). Other constraint satisfaction problems are the celebrated travelling salesman and random k-satisfiability problems [151][152][153].
The methods used to address such problems include techniques such as messagepassing (including belief propagation) and the cavity method. These tools are also relevant for spin glass problems in physics, and it is therefore natural for a statistical physicist to work at the interface of optimisation and computer science. Physicists have also been instrumental in characterising the typical properties of instances of constraint satisfaction or optimisation problems, this will be discussed further below (section 7.3).
It is hard to avoid seeing possible connections between NMA and these classes of problems. For example, it is natural to ask if ranking in NMA can be phrased as a constrained optimisation or satisfaction problem. It is perhaps useful to think of the meta-analytic network as a bipartite graph, with one type of node representing treatment options and the other trials (e.g. figure 2(d)). If there are N treatment nodes in the graph, then the ranking task consists of assigning the ranks 1, . . . , N to those nodes, while satisfying constraints set by the trials or minimising a cost defined by the trials (each trial provides noisy information on the relative ranks of some treatments). The NMA problem hence falls into a class of broader problems: we start from a set of N objects (the treatment options), who each have an unknown quality (treatment effect). We have noisy estimates of the relative qualities of subsets of these objects (from the trials), and we now wish to estimate the intrinsic property of each object. If this is not possible, what can we say about the relative comparison between pairs of objects, or a ranking of the objects in terms of quality?
While details would have to be thought through carefully we think that it is well possible that parallels between NMA and existing optimisation or satisfaction problems can be established. There is then potential for statistical physicists to contribute via the methods mentioned above (message passing, cavity method, etc). We think this is an exciting perspective for future work.

Network meta-analysis and disordered systems
Related to the previous item there may be interest in looking at the typical properties of the NMA problem and its solution. This is a common approach in constrained optimisation and satisfaction problems. Instead of looking at single instances of these problems, one assumes the network is drawn from an ensemble of graphs, and then asks what the average or typical properties of problems in this ensemble are. For example, how many solutions to the optimisation or satisfaction problem are there, what subspace do they form in the overall state space (e.g. is there one connected manifold of solutions vs fragmented clusters) and what is the typical quality of the solution (i.e. how many of the constraints can be fulfilled)?
To answer these questions, an average over assignments of the graph and constraints needs to be taken. One can then make use of tools from the physics of spin glasses and disordered systems, such as the replica method, dynamic mean field theory or cavity approaches. This allows one to characterise the energy landscape associated with these problems, the geometry of the solution space, and most importantly the performance of algorithms to find solutions to the optimisation or satisfaction problem.
It is conceivable that a similar approach could be taken in NMA. Based on known statistical features of meta-analytic graphs [145,154] one could define an ensemble of random NMA problems (i.e. the configuration of trials, treatment options and their connections is drawn from a distribution). One could then try to assess how features of the ensemble (e.g. connectivity, regularity, etc) affect the outcome of the NMA problems in the ensemble. This would allow one to step away from single instances, and instead to say more about typical cases.

Markov chain approaches to ranking
Markov chain approaches have been used as ranking methods in a variety of fields [155][156][157]. Most notably, Google's PageRank algorithm [155] ranks web pages in their search engine results. Broadly summarising, the algorithm is based on a Markov chain that models a 'random surfer' who, after being randomly assigned an initial webpage, moves from page to page by randomly clicking links. The algorithm also allows for a damping mechanism that works by assigning a certain probability with which the surfer is re-set to a random page at each step (this ensures that pages with no incoming links are also visited from time to time). The stationary distribution of the Markov chain then informs the ranking of the webpages. Broadly speaking, a page with more incoming links is more likely to be visited and therefore attracts a higher probability at stationarity, resulting in a higher rank.
As in the random-walk framework, each state of the Markov chain is a treatment in the network. At each discrete time step the process moves from one treatment to another. Transitions represent a preference between two treatments: when the process is currently in state a, the probability of transitioning to b in the next time step is related to the probability that treatment b is more effective than treatment a [119]. The stationary distribution of the Markov chain can then be used to rank the treatments, where a higher probability of being selected indicates a more effective treatment.
The initial distribution of the Markov chain can be chosen so that it reflects clinically important factors other than the treatment effects, for example the cost or safety of the treatments [162]. Similar to the PageRank algorithm, Chaimani et al introduce a resetting mechanism: at every time step there is a non-zero probability of hopping to a state drawn from the initial distribution. This means that the stationary distribution of the Markov chain now depends on this initial distribution as well as the transition probabilities. This allows one to incorporate information about a range of factors that influence decision making.
The work by Chaimani et al demonstrates one way in which Markov chains can be used to rank treatments. Further research on this topic, adapting and extending the method to answer various questions in NMA, would be of interest. Extensions could include additional constraints to the walker so that the treatment ranking reflects a range of desirable properties. For example, transition probabilities could be constructed based on multiple (potentially conflicting) treatment characteristics with weights assigned that reflect the importance of each characteristic in the decision making process.
Given that the statistical physics community is experienced in methods involving random walks and Markov processes, we believe this is a topic in which physicists can contribute. One example is the theory of random walks with resetting, developed in the physics community over the last decade or so (see [166] and references therein), and which could turn out to be useful in the context of ranking algorithms.

Machine learning approaches to systematic reviews and Bayesian Markov chain Monte Carlo
The field of machine learning interfaces with statistical physics, and there is increasing interest by physicists in machine learning methods. There are multiple ways in which machine learning can contribute to the field of NMA, and this therefore defines another point of contact with statistical mechanics.
One way in which machine learning can be used in NMA relates to the Bayesian approach in section 3. Methods from machine learning have been proposed as alternatives to MCMC in a general context. This includes techniques such as expectation propagation [167], variational Bayesian inference [168] and integrated nested Laplace approximations [169,170].
A perhaps even stronger link to machine learning presents itself in the process of data acquisition for an NMA. In this paper, we have so far focused on the procedure of carrying out an NMA given a set of data from multiple trials. In reality, the process of compiling the data starts by performing a 'systematic review' of the existing trials for a particular medical problem. This is a systematic screening of the literature, using well defined procedures, followed by a process to decide which trials are adopted for the NMA. This decision making again uses well defined protocols and criteria. As part of this process a large number of journal articles must be searched, and appropriate data must be identified and extracted.
This obviously lends itself to automation, which speeds up the process, saves resources, and removes human error and inconsistencies that arise when a team of multiple researchers collects the data. Machine learning methods have indeed been employed to automate or semi-automate this process and there is significant ongoing research and development in this field [171][172][173].
Given these contact points between NMA and machine learning, we think that researchers working at the interface of statistical physics and machine learning may find interest in applying the ideas and methods they are familiar with to the field of evidence synthesis. Their experience may then lead to the development of algorithms that improve on existing methods.

Simulation techniques
Simulation techniques play an important role in NMA. Most evidently, Monte-Carlo sampling is necessary to carry out a Bayesian NMA (section 3). Further, simulation studies can provide insight into NMA [174]. These are studies in which trial data is generated synthetically (i.e. simulated). This allows one, for example, to test how NMA fares on different networks, and how different features of the graph affect NMA outcomes [81,175,176]. The simulation uses, for example, the random effects model described in section 2.4, we note again the hierarchical structure and the different levels of randomness. In simulation studies one typically has to look at many instances of networks, and average over a large number of realisations of synthetic data. Efficient simulation methods for the generation of data are therefore key. Statistical physicists are, due to their training, naturally familiar with simulation methods for random processes.
Simulation studies are of course an established part of statistics and consequently, that there is significant existing expertise. At the same time, the area will likely evolve further in the future. For example, one might think of including modelling patient behaviour (e.g. compliance with the treatment [177]). This could lead to an agent-based component of the model, related to opinion dynamics and decision making, an area in which physicists also have contributed [2]. We therefore think that the development of future simulation methods defines another prospective avenue for statistical physicists to collaborate with medical statisticians, and to contribute to the field of NMA and evidence synthesis more generally.

Meta-analysis in particle physics
As a final, more speculative thought, we note that meta-analysis is used in particle physics to obtain the best estimates of particle properties such as masses, widths and lifetimes [178]. Expertise developed in this area might also be useful for meta-analysis and NMA in medical statistics.

Summary
Most of the existing NMA literature is naturally written by medical statisticians for medical statisticians, or for researchers actively using NMA tools and software packages in clinical practice. As a consequence, it is not easy to find an account of the essentials of NMA presented in a language physicists would be used to. The objective of this perspective review is to make a first step towards rectifying this. We hope the paper is a useful introduction to NMA for statistical physicists. The paper was, of course, written from our own limited perspective, and the selection of topics and the presentation are subject to personal bias.
Naturally, we could only include what we considered to be the most essential aspects of NMA for a physicist. Topics which could have been covered in a more extensive review include individual participant data [179], multi-component interventions [180], multiple outcomes [181], bias adjustment methods [18] and goodness-of-fit assessment [17]. In making our selection of the material for this paper, we aimed to focus on the concepts, ideas and methods, which would best enable the reader to access the wider literature. We tried to write a self-contained systematic introduction which could serve as a starting point for the reader to then explore the field more effectively.
We also hope that the paper highlights the importance of the area of evidence synthesis. Our review is successful if it excites others and if it convinces the members of the community that NMA is a field in which statistical mechanics can make a difference. In a given row all entries of X i are zero, except those corresponding to the treatments that are being compared. More precisely, we distinguish two cases: (i) the trialspecific baseline treatment is not the global baseline (t i,1 = T 1 ), and (ii) the trial-specific baseline is the global baseline (t i,1 = T 1 ).
We focus first on case (i). In row the matrix entry for the treatment t i, that is being compared against the trial-specific baseline is set to +1. The entry in the column corresponding to the trial-specific baseline (which is among the {T 2 , . . . , T N } in case (i)) is set to −1. All other N − 3 entries in row are zero. In situation (ii), we again set the entry for the treatment t i, that is being compared against the trial-specific baseline to +1. All other N − 2 entries in row are zero.
To illustrate this, consider the example network in figure A1. This network consists of N = 5 treatments and M = 2 trials. The global baseline treatment is T 1 . The vector of basic parameters is d = (d T 1 T 2 , d T 1 T 3 , d T 1 T 4 , d T 1 T 5 ) . Trial i = 1 compares treatments A 1 = {T 2 , T 3 , T 4 , T 5 } with trial-specific baseline t 1,1 = T 2 . This is an example of option (i) and its design matrix is Trial i = 2 compares treatments A 2 = {T 1 , T 3 , T 5 } and its trial-specific baseline is t 2,1 = T 1 . This is an example of option (ii) and its design matrix is for these variances to converge to the same value [71]. To assess the number of iterations required for convergence we split each chain (total length 2n) into n/b batches of length 2b. For a given integer k = 1, . . . , n/b we then calculate the pooled variance, the average within-chain variance and their ratioR based on the samples in the first k batches, where the first half of this data is discarded as burn-in, i.e. for k = 1, we use the first batch (length 2b), discard the first b iterations of it, and compute the variances and their ratio based on iterations b + 1, . . . , 2b. For k = 2, we use the first 4b iterations (two batches), but again discard the first half of this, and compute the variances from iterations 2b + 1, . . . , 4b. For higher values of k we proceed analogously.
Plotting the variances and their ratio against k (or 2kb) as shown in figure B1, we can assess the approximate number of iterations required for the variances to stabilise and forR to be sufficiently close to unity so that the chain can be assumed to have converged [72]. More recent refinements of this method are discussed in [70].

Appendix C. Estimating within-study variances and correlations of observed treatment effects
In this appendix we discuss how the matrix V i in equation (41) (see section 4.1) is estimated from trial data for the example of binary outcomes. This matrix describes the variance and correlations within a trial due to fluctuations arising from the finite number of subjects in the different arms of the trial (i.e. sampling errors).

C.1. General setup and estimating the variance of observed treatment effects
For binomial data, the observed relative treatment effects in trial i are given by the LORs, wherep i, = r i, /n i, is the proportion of events in arm of trial i. The variances, σ 2 i,1 , and covariances, Cov(y i,1 , y i,1 ), associated with these observations define the covariance matrix V i in equation (41). These values can be estimated from the data. To do so we first work out the random sampling variance associated with the valuesp i, .
The number of events measured in arm of trial i is a binomial random variable r i, ∼ Bin(n i, , p i, ). It has mean E(r i, ) = n i, p i, and variance Var(r i, ) = n i, p i, (1 − p i, ). Using Var(bx) = b 2 Var(x) (for random variable x and constant b), we find Assuming n i, is large, and propagating the errors for the logit function to linear order then leads to We get an estimate of this variance by setting p i, =p i , Since the valuesp i, for different are independent, the variance of y i,1 is σ 2 i,1 = Var(y i,1 ) = Var[logit(p i, )] + Var[logit(p i,1 )] which we can estimate using equation (C.4), [5, 28,82]. The conventional assumption is that the estimatesσ 2 i,1 can be used in the matrix V i in equation (41) in place of the true variances [7,82].

C.2. Estimating correlations between different observed treatment effects in the trial
The observations y i,1 of the relative treatment effects within a trial are correlated because they all involve the same common treatment arm t i,1 . To calculate the covariance between the observations we start from the general relation (C.10) Again, the convention is to assume that the estimate of the covariance in equation (C.10) can be used in place of the true covariance [17].

C.3. Limitations of estimating within study variance
It is evident from equations (C.1) and (C.5) that the values y i,1 andσ 2 i,1 are correlated (both expressions depend on the variables (r i, , n i, ) and (r i,1 , n i,1 )) [82]. This causes a systematic relationship between the magnitude and weight of the observations which leads to a bias on the overall effect estimates [15,82].
Estimation of y i,1 and σ 2 i,1 from the observed data also causes problems when events are observed for either no patients in a trial arm, or for all patients in an arm. By inspection of equations (C.1) and (C.5), we notice that if r i, = 0 or r i, = n i, then y i,1 andσ 2 i,1 will be undefined. A common ad-hoc method to avoid this problem is to add a value of 0.5 to every r i, and n i, . This has been found to produce biased estimates of effect size [15,182,183]. In fact, this limitation, along with the assumption that within-study variances are known, is the main criticism of the frequentist inverse-variance approach to NMA [184,185]. Alternative methods such as the Mantel-Haenszel method and generalised linear mixed models (GLMM) have been recommended [14,15].

Appendix D. Generalised least squares problem
To find the GLS estimator we carry out a transformation of the GLS model so that it fulfills the OLS condition. To this end we note that given that C is a covariance matrix, it must be symmetric and positive-definite. Therefore we can write C = K K = KK where K is the (symmetric) square root of C [84]. We now multiply both sides of equation (42)  In this section we derive the method of moments estimator for heterogeneity in pairwise meta-analysis. We begin by evaluating the expectation of Q, whereŷ is defined in equation (60) and the observations y i are assumed to follow the random effects (RE) model, Using E RE (y i −ŷ) = 0, we find From the RE model in equation (E.2) we know Var(y i ) = σ 2 i + τ 2 . We now wish to obtain the second and third terms of equation (E.4) in terms of Var(y i ). To do so we Re-arranging for τ 2 we find where we have used the definition ofŷ in equation (60). This is the result quoted in equation (62) in the main paper.

E.2. Network meta-analysis
In this section we derive the DerSimonian and Laird [5] method of moments heterogeneity estimator for NMA. As before, we begin by evaluating the expectation of Q under the random effects model. We now have (equation (63)) and (equation (64)) To simplify equation (E.11) we follow Jackson et al [99] and define the matrix such that y −ŷ = VAy. (E.14) Therefore, Q = (VAy) V −1 (VAy) = y AVAy (E.15) since both V and A are symmetric. The former is symmetric by definition, the latter can be shown to be symmetric by taking the transpose of the right-hand side of equation (E.13). By explicit evaluation we find 16) such that Q = y Ay. (E. 17) As in the pairwise case we take the expectation of Q under the random effects model. Defining ξ = η + ε, we re-write the RE model (equation (39)  By explicit evaluation we find and hence, E RE (Q) = E RE (ξ Aξ).

(E.21)
For any vector z with mean μ z and covariance matrix V z , one has E(z Bz) = tr(BV z ) + μ z Bμ z where B is a square matrix and tr(.) indicates the trace of a matrix. This identity can be checked directly, see also [186] (theorem 4, p 75, chapter 2). The vector ξ in equation (E.21) has mean 0 and covariance matrix V + Σ, therefore In the last step we have written Σ = τ 2 P where P is a block diagonal matrix. Each (m i − 1) × (m i − 1) block (representing trial i) has diagonal elements equal to 1 and off-diagonal elements equal to 1/2. All other elements are zero. Using the fact the trace is invariant under cyclic permutations we find where M i=1 (m i − 1) is the lateral dimension of V (the number of observations in y ) and N − 1 is the lateral dimension of X V −1 X (the number of mean relative treatment effects we wish to estimate). The expression in equation (E.23) is the number of degrees of freedom associated with the regression (that is, the difference between the number of data points and the number of parameters 13 ).
Setting E RE (Q) equal to the observed value of Q in equation (E.11) and re-arranging for τ 2 we find .
This is the result given in equation (65) of the main paper.

Appendix F. Bias in maximum likelihood variance estimation
Consider a random variable y = (y 1 , . . . , y N ) with normal distribution, y ∼ N (μ, σ 2 ). The likelihood function is then Maximising the likelihood function with respect to μ and σ 2 leads to the expressions.
The result of the joint maximisation with respect to μ and σ therefore leads to the maximum likelihood estimators, that is, we have to substitute the maximum likelihood estimateŷ into the expression forσ 2 .
To show that the expected value of the variance estimatorσ 2 is not equal to the true variance σ 2 we re-write equation (F.5) aŝ which, after some rearranging and using the fact thatŷ − μ = 1 The expectation ofσ 2 is then The first term is the true variance σ 2 . The second term is the variance ofŷ, Therefore, indicating that the ML variance estimator is biased downwards by σ 2 N . For large samples N → ∞, this bias becomes negligible.

Appendix G. Restricted maximum likelihood estimation of heterogeneity
In section 4.3.2 we explained that the REML method for the estimation of heterogeneity is based on a transformation of the variables y so that the likelihood depends on the parameter estimatesβ rather than the parameters themselves. The associated 'restricted' likelihood of this transformed variable is given by RL(φ|y, X) ∝ (det C) −1/2 (det X C −1 X) −1/2 × exp − 1 2 (y − Xβ) C −1 (y − Xβ) . (G.1) We can arrive at this expression in several ways. One possible method involves evaluating the marginal likelihood of the transformed variable (see [187] for details). Alternatively, one can use a Bayesian interpretation [105]. Here, one assumes that nothing is known about β, assigns an improper flat prior (a prior that is not properly normalised), and where the truncation at zero ensures thatτ 2 REML remains non-negative. The joint system of equations (G.6) and (58) can then be solved iteratively forτ 2 REML andd RE .
Appendix H. Adjusting multi-arm trial weights using a result from electrical network theory Here we explain the method for adjusting variances associated with measurements in an NMA in order to account for correlations introduced by multi-arm trials [50,129]. This is based on the method described in Gutman and Xiao [130] for reconstructing the individual resistances in an electrical network from the effective resistances between pairs of nodes. We focus on a multi-arm trial i which compares m i treatments. This trial yields a total of q i = m i (m i −1) 2 treatment effect estimates and associated variances v i,ab . For a random effects model these variances are v i,ab = σ 2 i,ab +τ 2 whereτ 2 is an estimate of the between trial heterogeneity. In a fixed effect model v i,ab = σ 2 i,ab . We write these variances in an m i × m i matrix,Ṽ i . We label this matrixṼ i to distinguish it from the (m i − 1) × (m i − 1) matrix V i defined in the main paper. Each row and column ofṼ i represents a treatment in trial i. The diagonal elements are equal to zero and each off-diagonal element is the variance associated with the comparison of the corresponding pair of treatments. For example, in a multi-arm trial i that compares treatments {T 1 , T 2 , T 3 }, the variance matrix is The aim now is to reconstruct the (inverse-variance) weights for a set of three twoarm trials that yield network estimates of relative treatment effects whose variances are equal to the variances inṼ i . To this end, we use a method from electrical theory for back calculation of edge resistances given a set of effective resistances [50,130].
We saw in section 6.1 that the effective resistances in an electrical network are related to the pseudo-inverse of the Laplacian. In NMA, effective resistances are associated with the variancesṼ i observed in the multi-arm trial. A result from electrical network theory is that we can construct the pseudo-inverse of the Laplacian directly from the effective resistances [130]. Using this result for an m i -armed trial with 'effective' variancesṼ i gives the m i × m i matrix where O i is an m i × m i matrix of ones [50]. An equivalent more compact expression (used in [129]) is where B i is the edge-incidence matrix for trial i that describes what edges (treatment comparisons) are present in the trial [129]. B i has dimensions q i × m i , where we recall q i = m i (m i −1)

2
. Each row of B i represents a pairwise comparison in the trial, and each column represents a treatment. There is a 1 in the column corresponding to the baseline treatment for that comparison and a −1 in the column representing the treatment compared to that baseline. All other entries are zero.
Once we have the pseudo-inverse of the Laplacian we can work out the Laplacian using L i = (L + i ) + and [50,189] In an electrical network the graph Laplacian is defined by the individual (physical) resistors, L ab = −R −1 ab for a = b, and L aa = b R −1 ab . Therefore the values R ab are obtained by inspection of the off diagonal elements of L. Similarly, in the NMA context, the adjusted (inverse-variance) edge weights for multi-arm trial i can be obtained from the off diagonal elements of the Laplacian L i . For the three-arm trial, wherew i,ab are the adjusted edge weights that describe the inverse-variances for a network of two-arm trials which is equivalent to the multi-arm trial.