On modelling of artefact instability in interlaboratory comparisons

Key comparisons are at the core of metrology and support the international equivalence of measurement standards. Typical key comparison setups involve travelling artefacts which are transferred between the participants of a study. After each participating laboratory performs its measurements of the artefacts, a subsequent analysis reveals the degree of equivalence between the participants. For this analysis stage, the stability of the artefacts plays a crucial role, and violations of the stability need to be taken into account to allow for a meaningful comparison. In this work, we present several mathematical models for a treatment of non-negligible artefact instability effects in bilateral comparisons. We highlight the underlying model assumptions and derive analytical formulae for the estimate and standard uncertainty of the instability effect. Moreover, we derive the bilateral degree of equivalence by applying the models in a treatment essentially based on the GUM (JCGM-100). Our considerations conclude with numerical experiments using data from a bilateral comparison on illuminance and from a recent CCM key comparison of kilogram realisations.


Introduction
Key comparisons play a crucial role in metrology and since the establishment of the mutual recognition agreement (MRA) [1] they have formed the foundation for the international equivalence of measurement standards provided by the national metrology institutes and the calibration and measurement certificates they issue. Key comparisons are also important in the realisation and dissemination of units, for example to underpin the new definition of the kilogram. In a typical key comparison, a travelling artefact is measured by the participating laboratories and a subsequent analysis estimates the * Author to whom any correspondence should be addressed.
Original Content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. key comparison reference value and determines the degrees of equivalence (DoE) of the participants for the application at hand. This analysis should include a stability assessment regarding the artefact. An artefact is considered stable, if its property under consideration, taken as the quantity of interest in the comparison, remains constant or variations are negligible. A stability assessment, in particular in key comparisons, is important, since an artefact might be damaged during travel or the artefact may naturally change its value, e.g. as a function of time or movement. Artefact instability can lead to an apparent inconsistency between the measurement results that does not stem from any real inconsistency between the laboratories. Robust statistical estimation is one approach to dealing with inconsistent data (see e.g. [2][3][4][5]). Another approach involves modelling the instability, and it is this approach that we will follow here.
In a recent and highly relevant key comparison for the new definition of the kilogram [6][7][8], a star-type comparison between the pilot laboratory and the other participants was employed. Here, each participant measures its individual travelling artefact and then sends it to the pilot, which measures and returns it after examination. The instability of the artefact is estimated by additional measurements performed by the participant and information regarding the uncertainty contribution due to the instability were included in the uncertainty budget. For the kilogram study, each participant is requested to report the resulting instability as a measured mass difference in terms of an estimate and associated uncertainty. For the subsequent analysis, either the absolute value of the estimate or the uncertainty is taken into account, whichever is larger. In another comparison CCM.FF-K4.2.2011 [9], micropipettes were employed to compare calibration procedures for volume measurements. Information about the instability of the pipettes were concluded from three measurements at the pilot laboratory. Then, the standard deviation of these measurements is added to the uncertainty budget. In a comparison of capacitance standards CCEM-K4.2017 [10], some capacitors were removed from the study due to their high travelling instability. The instabilities of the remaining artefacts are assumed to be included in the Type A evaluation of uncertainty considered by the reporting participant. Therefore, no additional correction is considered. In a bilateral comparison study on illuminance [11], the observed drift of a photometer was estimated and included in the uncertainty evaluation. However, explicit formulae are not provided.
This diversity in the treatment of instability effects is naturally imposed by the applications, the type of interlaboratory comparison, and by the properties of the artefacts employed. Nevertheless, we observe a lack of transparency with respect to the mathematical models chosen for the treatment of artefact instability. Rigorous modelling is required to clearly formulate and justify the underlying assumptions. Ad hoc and non-rigorous modelling choices challenge the reliability of the results of the comparison.
Several instability models have already been considered. For circulating artefact comparisons, [12,13] consider linear trend models. In [14], several studies on the stability of gas mixtures are discussed and, in addition to model choices that account for instability, a fixed or random effects model is considered (see also [13,15]). For a discussion on random stability effect models, we refer to [16,17]. Also [18] discusses deterministic and random drift models that introduce time-dependency of the travelling standard. Another important topic is the analysis of statistical power, which is often disregarded in the analysis of interlaboratory comparisons. Statistical power of a hypothesis test is the probability of detecting an actual effect from the present data, such as a significant lack of conformity of a laboratory. When taking an instability effect into account, the statistical power for the detection of a lack of conformity of a laboratory can significantly decrease, leading to almost meaningless conclusions. For a discussion about the decrease of statistical power under instability effects we refer to [16,17]. A consideration of power is, however, out of scope for this work and will be considered elsewhere.
In this work, we focus on bilateral comparisons and formalize a simple but general framework for the artefact instability assessment based on a three-stage measurement procedure. We derive suitable mathematical models for a rigorous treatment of an instability of the artefact. The analytical formulae provided for the estimate and standard uncertainty of the instability can be easily applied in interlaboratory comparisons. In addition, we determine the resulting bilateral DoE, in the presence of artefact instability, based on probability calculus and the law of propagation of uncertainty as advocated by the GUM (JCGM-100) [19]. We like to point out that the references provided above do not generally apply for the case of bilateral comparisons. In particular, the random drift models considered in [13,18] require additional assumptions on the measurement or additional measurements from participating institutes. In a first step of our work, Linear Drift (LD) models [20] are analysed. Then, more flexible instability models are employed that allow the value of the instability at a certain point in time to be realized over a large range according to some probability. We employ these models and illustrate the impact of the model choice on the resulting estimate and uncertainty for the instability. These models can be utilized for the analysis in a key comparison, e.g. to determine the unilateral DoEs. For illustration purposes, we will here consider bilateral DoEs. We exemplify the derived instability models through their application to data taken from a study on illumination [11] and from a recent CCM.M-K8.2021 key comparison carried out to underpin the comparison of different realisations of the redefined kilogram [8].
With this work, we emphasize the importance of rigorous and transparent modeling and we stress the significance of transparent model choices during the analysis of an interlaboratory comparison. To this end, we summarize the key results for the developed models in section 2 and provide easy-touse formulae. For the mathematically and statistically inclined reader, we perform an in-depth derivation for time-dependent LD models in section 3 and for conditional uniform and timedependent triangular models in section 4. In section 5, a discussion of the presented methods is provided and the models are applied to data from real-world comparisons. Finally, section 6 draws conclusions and gives an outlook on potential further research.

Bilateral comparisons involving travelling artefacts
In a bilateral comparison study, the conformity of the two participating laboratories can be assessed by consecutive measurements of a travelling artefact. At time t 1 , laboratory 1 measures the artefact and quotes its measured value 2 y 1 and the corresponding standard uncertainty u 1 . Then, the artefact is 2 We take the measured value yi as the mean of a random variable Yi that describes the state-of-knowledge about the value of the artefact gained in the measurement and, in abuse of notation and in accordance with the GUM, Yi also denotes the value of the artefact at laboratory i for i = 1, 2, 3. u 2 i denotes the variance (squared standard uncertainty) of Yi. transferred to laboratory 2, which performs an independent measurement 3 of said artefact at time t 2 and reports y 2 and u 2 accordingly.

Bilateral DoE without instability model
The difference of both measurements and its expanded uncertainty define the bilateral degree of equivalence (bilateral DoE) between laboratory 1 and laboratory 2. This difference is not expected to be zero, since it comprises all sources of error involved in the measurement of the laboratories, including possible overlooked effects. The consistency of the results, however, has to be judged during the data analysis stage of the bilateral comparison. In this Two Measurements (TM) scenario, the quantity ∆ TM underlying the bilateral DoE and the DoE 4 between both laboratories is given by the simple additive model where we denote the corresponding estimate and standard uncertainty of the bilateral DoE by δ TM and u δTM , respectively. Within this setting, no instability consideration of the artefact is possible since such an effect is indistinguishable from the individual effects of the measurement procedures. In other words, without further assumptions, an instability effect is not identifiable. Therefore, it is common practice (see for example [7,11]) to perform additional stability measurements, e.g. a third measurement Y 3 at time t 3 , after sending the artefact back to laboratory 1. This measurement again yields an estimate and corresponding standard uncertainty y 3 and u 3 . In the following, we tackle the general question of how to incorporate this additional measurement to obtain a meaningful statement regarding the bilateral DoE. We have collected the notation and quantities involved also in table 1. The remainder of this section presents the key results of the paper.

Comparison of instability models
Having a stability measurement as described in the previous section enables a number of different stability assessments. Each requires a mathematical model and is based on certain assumptions. We use the symbol D to denote the instability effect between the first measurement at laboratory 1 and the subsequent measurement at laboratory 2.
The resulting general model for the (quantity underlying the DoE and the) DoE that accounts for an instability effect D is given by where we denote the estimate and standard uncertainty of the instability D by d and u d , respectively. By ρ 1,d we indicate the correlation between the initial measurement at laboratory 1 and the considered instability D. If D is defined independently from Y 1 , then the bilateral DoE is given by where the estimate d and the standard uncertainty u d might be obtained from additional or independent measurements. Based on this model assumption, different instability models and resulting formulae for d and u d will be discussed in the following.
In table 2 we summarize the estimates and standard uncertainties of the instability effect D and the bilateral DoE for different underlying models, which will be derived in the following.

LD model.
One particularly simple instability model for the artefact value is to assume a linear drift between Y 1 and Y 3 . Such an LD model can be applied, for example, to the thermal stability of semiconductors [21]. In this model, the instability can be considered as a value on the straight line between 0 at time t 1 and Y 3 − Y 1 at time t 3 . Assuming in addition that the pilot measurement Y 2 is performed halfway through the comparison study, i.e. t 2 = 1 2 (t 3 + t 1 ), the instability is given by To deduce an estimate and a corresponding standard uncertainty, we employ the GUM framework [19], which derives the estimate d LD from D by inserting the estimates y 1 and y 3 of Y 1 and Y 3 respectively. The corresponding squared standard uncertainty is obtained by the law of propagation of uncertainty Even though the LD model can be seen as a deterministic model, note that the drift D is modeled by a random variable whose distribution is uniquely determined by Y 1 and Y 3 . The resulting DoE under the LD assumption is given by (LD) Linear Drift model. Assume a linear change in the value of the artefact. Value of instability is set to half the difference of measurements at laboratory 1. General, time-dependent expressions are given in section 3 with additional weight towards the middle value. General, time-dependent computations are given in section 4.2.
where ρ 1,3 denotes the correlation between the initial-and the stability measurement performed at laboratory 1. This correlation is natural when, e.g. the same measurement instrument is utilized. In a typical metrological scenario, the correlation fulfills 0 < ρ 1,3 < 1. It can be seen immediately that for u 1 = u 3 the relation u 2 δLD ⩽ u 2 δTM holds, i.e. the additional knowledge gained from the second measurement at laboratory 1 reduces the overall uncertainty of the DoE compared to the TM model (1). Some additional observations are listed in the following: If, in addition u 1 = u 3 holds, the LD assumption does not decrease the standard uncertainty of the DoE compared to the TM model.
• If there is no correlation between Y 1 and Y 3 , i.e. ρ 1,3 = 0 and the uncertainties obtained by laboratory 1 are equal, i.e.
The assumption of a LD as above is equivalent to laboratory 1 reporting the average of its measurement results An in-depth derivation for more general time-dependent LD models is given in section 3.

Uniform probability model.
Assume that the value of the instability of the artefact at time t 2 lies equally likely in This means in particular that D is not larger than the difference of the measurements at labor- A thorough analysis of this model assumption is performed in section 4.1. With this uniform probability (UP) model, one obtains as estimate and squared standard uncertainty for the instability Consequently, the following bilateral DoE can be obtained: It is to note that the standard uncertainty of the DoE depends directly on the squared difference of the measured values of the artefact. If this additional term vanishes, i.e. if y 1 = y 3 , and in addition u 1 = u 3 , then it holds that the UP model yields smaller standard uncertainties than the TM model u δUP ⩽ u δTM with equality if and only if ρ 1,3 = 1.

Triangular probability model.
The previously assumed uniform distribution allows for a random characterization of the instability. However, it might be unlikely that the instability takes every value in the interval [0, Y 3 − Y 1 ] with equal probability. As we have seen in the LD model, it might be advantageous and more likely to assign higher probability to a particular value in this interval. In this paragraph, we model the distribution of the instability as a triangular distribution. The mode of the distribution is hereby chosen equivalent to the LD scenario, namely half the difference between artefact values measured in laboratory . For a general derivation using time-dependent distribution parameters, we refer to section 4.2.
For this Triangular Probability (TP) model of the artefact instability, one obtains as estimate and squared standard uncertainty for the instability.
For the bilateral DoE, one obtains Again, the standard uncertainty of the bilateral DoE depends directly on the observed difference between y 1 and y 3 . Also, the consequences of the previous UP model carry over to the triangular distribution. In fact if y 1 = y 3 and u 1 = u 3 , it holds again that u δTP ⩽ u δTM with equality if and only if ρ 1,3 = 1. Moreover, it can be seen that the standard uncertainty of the DoE is reduced over the standard uncertainty of the DoE for the UP model, i.e u δTP ⩽ u δUP and the impact of the observed difference y 3 − y 1 is diminished.

Discussion on synthetic data.
In figure 1 we show the impact of the different artefact instability models on synthetic data. The following values (with arbitrary units) are considered and shown in figure 1(a): In figure 1(b) it is clearly visible that the TM model is independent of the correlation ρ 1,3 , since no stability measurement Y 3 is performed. Also, it can be seen that the LD model (blue line) yields the smallest uncertainty for the DoE. In the following sections we will show that this holds true for all models and settings considered. For the UP model (cyan line) and TP model (red line), the resulting standard uncertainty for the bilateral DoE is larger and, for some values of ρ 1,3 close to 1.0, even exceeds the standard uncertainty of the TM model.
The UP model leads to larger uncertainties than the TP model because of the additive dependence on the squared difference of the measured values (y 3 − y 1 ) 2 , with the triangular model having a smaller prefactor. The histogram plots at the bottom show the distributions of the instability D for the different choices of ρ 1,3 ∈ {0.1, 0.5, 0.95} and different instability models. Each histogram is obtained from Monte Carlo sampling of the instability D assuming Gaussian distributions for Y 1 and Y 3 . The choice of Gaussianity is an assumption made here and it is based on the availability of only an estimate and corresponding standard uncertainty, cf also Supplement 1 to the GUM, JCGM:101 [22]. For the uniform and triangular model, conditional sampling is applied according to the models defined in section 4. The inset in each plot shows the shape of the joint distribution of Y 1 and Y 3 dependent on the correlation ρ 1,3 . Based on the joint Gaussian distribution of the inputs Y 1 and Y 3 , the distributions of the unconditional instability D for the uniform and triangular model are obtained as follows. First, draw i = 1, . . . , N dependent realizations (y i 1 , y i 3 ) from the joint Gaussian distribution of (Y 1 , Y 3 ). Then, draw a sample d i from the corresponding conditional distribution for D, given Y 1 = y i 1 and Y 3 = y i 3 . The histograms in figure 1 show the resulting samples d i for the respective models. For all scenarios, the TM model assigns no instability, hence the value is set to 0 with probability 1. For large values of ρ 1,3 , the LD model assigns high probability to the value 1 2 (y 3 − y 1 ) = −0.15, whereas the TP model and the UP model have a wider spread (also their distributions may not be well approximated by a Gaussian).
The source code and a web application that generates the plots above can be found here [23].

GLD model
In this section we assume a LD between Y 1 and Y 3 and generalize the considerations above to arbitrary times t 1 , t 2 and t 3 . We abbreviate this generalized linear drift model by GLD. The instability is considered as the value on a straight line at time t 2 . In particular we set Assuming this scenario, a straightforward application of the linear propagation of uncertainties formulae [19] yields the following estimate and squared standard uncertainty for the bilateral DoE Some remarks regarding the time-dependent LD model: • If u 1 = u 3 and we have perfect correlation ρ 1,3 = 1, then the uncertainty obtained by the LD model is consistent with that of the TM model, i.e. u 2 δGLD = u 2 1 + u 2 2 = u 2 δTM . Note, however, that the estimate is generally different.
Hence, for all times t 2 , the resulting uncertainty u δGLD is smaller than the uncertainty of the TM model u δTM if the correlation between the initial and the stability measurement is not perfect. • If u 1 = u 3 and ρ 1,3 < 1, the choice t 2 exactly half way between t 3 and t 1 minimizes the standard uncertainty u δGLD over time. This means that the LD model with t 2 = 1 2 (t 3 + t 1 ) has the smallest standard uncertainty in the class of timedependent LD models considered, i.e. u δLD ⩽ u δGLD .

Probabilistic models with conditional random instability effect
In this section we discuss two models that incorporate an additional randomness which allows more flexibility in the modeling of artefact stability. Here, the artefact instability D of model (2) is equipped with a probability distribution, conditioned on the measurements at laboratory 1. Since we will focus on identifiable models, a treatment of D as a random process (e.g. a random walk) is not possible, but the following uniform interval model in section 4.1 and the triangular model in section 4.2 do cover many cases of metrological relevance.

UP model
The UP model makes the assumption that the instability takes values in the interval [0, Y 3 − Y 1 ] with equal probability, while disregarding values outside of the interval. This assumption defines the following conditional distribution of the artefact instability where U denotes the continuous uniform (or rectangular) distribution. The condition on Y 3 − Y 1 can in this model be replaced by conditioning on the joint information generated by Y 1 and Y 3 . This includes the correlation coefficient ρ 1,3 , which is omitted in the notation of the conditional random variables, such as (17). Note that if Y 1 > Y 3 , we consider instead the interval [Y 3 − Y 1 , 0], which effectively changes the direction of integration but does not change the formulae for the estimate and standard uncertainty. The unconditional expectation of D can be obtained by the conditional expectation property [24], where one uses the fact that the conditional expectation of a random variable, given another random variable, is again, a random variable For the variance of D we require the second moment of D, which exists by construction, and we obtain In addition, we need to compute the covariance of Y 1 and the drift D. For this purpose, we observe that At this point, we have characterized the moments of the distribution of the instability D up to the second order. Now, we define the estimate for D as the expectation and the squared standard uncertainty of D as the variance, in accordance with JCGM:101 [22]. Hence, using the estimates y i and squared standard uncertainty u 2 i of Y i (i = 1, 3) as mean and variance, we obtain for the estimate d and standard uncertainty u d d = 1 2 (y 3 − y 1 ), . (22) This, together with the additive instability model definition in equation (2), results in the following bilateral DoE Due to the dependence on the actual measurement results y 1 and y 3 , it might be of interest to analyse the expected uncertainty under repeated realisations of Y 1 and Y 3 . Assuming that Y 1 and Y 3 have the same mean value, one obtains Then, the expected standard uncertainty of the bilateral DoE satisfies the following properties (assuming u 1 = u 3 ): • The expected standard uncertainty is consistent with the TM model under perfectly correlated measurements at laboratory 1, i.e. if ρ 1,3 = 1, it holds • Assuming 0 ⩽ ρ 1,3 < 1, the standard uncertainty is reduced compared to the TM model. In particular, a gain in knowledge is expected from the stability measurement in terms of

Time-dependent TP model
The previously assumed uniform distribution allows for a random characterization of the instability. However, it might be unlikely that the instability takes every value in the interval [0, Y 3 − Y 1 ] with equal probability. As we have seen in the LD model of section 3, it might be appropriate and more likely to assign higher probability to particular values in this interval.
In this section, we model the instability effect as where the parameter of the triangular distribution denote the left interval boundary, the mode, and the right interval boundary respectively. This model again assumes that the value of the instability is always smaller than Y 3 − Y 1 . However, values are more probable when they are closer to values predicted by the LD model. For this model, computations utilize the same techniques as in the uniform distribution case of section 4.1. Also, as in the previous section, the estimate and standard uncertainty of the drift term D are given by the expectation and the variance of the unconditional random variable D, here given by d = c 1 3 (y 3 − y 1 ) , and . Hence, if the additive instability model assumption (2) applies, one obtains for the bilateral degree of equivalence Considering again the expected standard uncertainty, averaging with respect to Y 1 and Y 3 is straightforward and yields As in the previous UP model, one can observe: • Equivalence with the TM model under perfect correlation, i.e. for u 1 = u 3 and ρ 1,3 = 1 it holds • Assuming u 1 = u 3 and 0 ⩽ ρ 1,3 < 1 it holds from which we can directly conclude that the TP model yields smaller expected standard uncertainties than the UP model, i.e. Eu δTP < Eu δUP for every scenario t 1 < t 2 < t 3 .

Discussion and numerical evaluation
The artefact instability models introduced are derived under different assumptions and it is left to the experts of the individual application to judge the suitability of the assumptions underlying each model. In this section, we show two examples from literature that carry out similar comparison strategies as explained in table 1 and we analyse the impact that the choice of the model has on the stability assessment.

Bilateral comparison on illuminance
We consider the bilateral comparison on illuminance carried out between IPT and LABELO [11]. The goal of the bilateral comparison was to approve the measurement equivalence between the two laboratories and to confirm their photometer calibration capability.
In this bilateral comparison LABELO acted as the pilot laboratory. A photometer was taken as the travelling standard and the calibration results at LABELO and IPT were taken to assess their equivalence. A calibration measurement was carried out first at LABELO before the photometer was transferred to IPT. After the calibration measurements had been conducted at IPT, the photometer was returned to LABELO and calibrated once again in order to estimate potential instability effects.
The calibrations of the photometer were considered at six calibration points taken as [20,110,400,1000,1600,2000] lx. We will concentrate on the results obtained at 1600 lx. The data in the framework of section 2 were derived from the figures given in [11, tables 1-3] as follows: The report [11] states that the drift is already included in the uncertainty of LABELO. Since the observed drift for all calibration points is small compared to the uncertainty of the single measurements, and we expect that these values had been added quadratically, the above chosen standard uncertainties of LABELO seem adequate. The correlation between the two measurements taken at LABELO is not reported in [11], but, following the previous argument, a high correlation is to be expected. Therefore, we carry out our analysis in dependence on an assumed ρ 1,3 ∈ (0, 1). Another possible strategy to estimate unknown correlation coefficients is given in [25]. Figure 2 shows the resulting estimate of the instability and associated expanded uncertainty for the instability models from section 2, using a coverage factor k = 2. Calculations were carried out following the formulae in table 2, assuming that the measurements at IPT were conducted at a time approximately in the centre of the interval between the two measurements at LABELO. For every value of ρ 1,3 , the LD model and the TP model yield similar uncertainties for the instability, while the distribution of the instability, depicted for ρ 1,3 = 0.1, 0.5, 0.95 at the bottom of figure 2, are quite different. The UP model produces the largest standard uncertainty. It is to note that the distributions of the instability D are not Gaussian for the UP and the TP model. However, the resulting estimates are the same. It can also be seen that the standard uncertainty decreases for increasing correlation coefficient ρ 1,3 . This can be explained for the UP model by the derivation in (26) and for the TP model by (34). For another discussion on such issues, see also [26]. Figure 3 shows the resulting bilateral DoE in terms of standard uncertainties (left) and of the E n values [1,27] (right). In our bilateral comparison scenario, this value reads where M denotes the abbreviation of the employed model, i.e. M ∈ {TM, LD, UP, TP}, and k denotes the coverage factor to translate u δM to an expanded uncertainty. We point out that we choose k = 2 although this cannot generally be advised, since the distribution of the DoE is not necessarily Gaussian for the UP model and the TP model (cf figure 1). On the other hand, if the instability effect D dominates the distribution of the DoE, the comparison itself might not be meaningful [16].  As can be seen in the left of figure 3, the standard uncertainty of the bilateral DoE depends on the choice of the instability model. The UP model yields larger standard uncertainties than the TP model and the GLD model covers a wide range of uncertainties when the time for the second measurement t 2 is varied. For highly correlated cases, i.e. ρ 1,3 > 0.8, the models proposed can yield standard uncertainties for the bilateral DoE that exceed the TM model result. This can be explained by the slightly larger standard uncertainty of the stability measurement u 3 > u 1 . The impact of the instability model choice is visible on the right of figure 3. The E n value reported in [11] is 0.45, which corresponds to the baseline TM model. The wide range of possible E n values for the LD model with t 1 ⩽ t 2 ⩽ t 3 is explained by the observed difference Table 3. Resulting standard uncertainties for the instability D according to [8,  in data. In particular, for t 2 close to t 1 , the GLD model converges to the TM model between y 1 and y 2 , resulting in an E n value corresponding to the TM model. On the other hand, for t 2 close to t 3 , the model collapses to a TM model between y 2 and y 3 , which has a larger absolute difference, resulting in E n values larger than 0.5, also for the highly correlated case. This observation challenges the applicability of the GLD model for values t 2 far away from half the difference of t 3 and t 1 . Nevertheless, the framework presented allows different choices of instability models to be explored.

Instability analysis for a comparison of kilogram realisations
Another important comparison from literature that adheres to the strategy in table 1 is the key comparison data from a recent study on kilogram realisations [8]. Due to the complexity of the comparison study and the new definition of the kilogram [28], we focus here solely on the analysis of the mass stability. In this comparison, each participating laboratory measures its travelling standard against a stable reference mass. Then, the kilogram realisation is sent to the pilot (BIPM) and after the pilot measurement it is returned to the participant for another measurement. From the two measurements of the participant, a correction and its corresponding standard uncertainty is estimated. The correction is taken as half the difference of the measurements performed at the participating laboratory. In [8, p 15], it is stated that the standard uncertainty of the correction was derived from two components. 'The first is the uncertainty of the determination of the mass change made by the participant' and 'the second component is related to the choice of the value chosen for the correction'. It is assumed 'that the mass while at the BIPM lies between the NMI stability measurements' and for this 'a rectangular probability distribution, centred at half of the observed change (used for the correction), and with width of half the observed change ...' is considered. Both components are added quadratically. In formulae, this readsd where the involved quantities were (in most cases) obtained independently from those taken for the key comparison. The apparent similarity to our UP model is natural due to the assumption of the rectangular distribution for the instability. However, the prefactor 1/3 from the approach presented here in section 4.1 is based on conditional probability calculus, whereas the formulae (39) adds the standard uncertainty of the observed difference to the standard uncertainty obtained from a rectangular distributed quantity. In the following, we replicate table 4 of [8] and include the standard uncertainties of the correction/instability, according to the models presented. The values were derived from table 4 of [8] as follows: column three, indicating the mass change, corresponds to y 3 − y 1 ; column four presents the standard uncertainty of the mass change, i.e. √ u 2 3 + u 2 1 − 2ρ 1,3 u 1 u 3 . Finally, column six showsũ, which can be compared to the standard uncertainty of the models of our table 2. Table 3 compares the different models proposed with the results reported in [8]. It can be seen that the reported standard uncertainty of the instability for every travelling artefact is larger than those estimated from our models. This is due to the different prefactors for the standard uncertainty of the mass change. In particular, the LD model halves the standard uncertainty of the mass change, neglecting the observed mass difference, whereasũ from (39) merely adds the (often negligible) observed mass difference times 1/12. As also derived in theory, it can be seen that (LD) < (TP) < (UP) in terms of resulting standard uncertainty of the instability.
We point out that the comparison above does not intend to criticize the proceeding in [8], where the relevant contributions to the uncertainty of the instability are identified and included in the analysis. The values presented in table 3 are based on different model assumptions and it is left to the experts to validate those modelling assumptions. However, the alternative approaches presented in this work can be used directly in the evaluation of similar comparisons. For the comparison on kilogram realisations, the instability models presented lead to smaller standard uncertainties for the artefact instability. Unilateral DoEs can be obtained directly by including the resulting instability estimate and standard uncertainty in the evaluation. This is, however, beyond the scope of this work, which focuses on bilateral comparisons.

Conclusion and outlook
We have presented several mathematical models to analyse a bilateral laboratory comparison that is affected by possible artefact instability. Table 2 summarizes easy-to-use formulae that can be applied immediately. All models presented for the bilateral DoE that comprise a stability measurement yield similar estimates. The choice of the model does, however, influence the standard uncertainty of the bilateral DoE. For highly correlated measurements at laboratory 1, the resulting standard uncertainties for the DoE can, depending on the model chosen, exceed those of the (naive) TM model. Therefore, during the data analysis stage of an interlaboratory comparison study, it is important to carefully consider the assumptions underlying possible choices for an artefact instability model. As demonstrated in this work, the choice of the model can affect the interlaboratory consistency as evidenced by the degree of equivalence. Finally, we note that the formulae for the estimate and standard uncertainty of the instability effect can be directly incorporated into any comparison involving artefact instability to, e.g. estimate unilateral DoEs.
The proposed list of models is not exhaustive and other classes of models can be considered in this regard. With the introduction of time-dependency in the LD model in section 3 and the time-dependent triangular model in section 4.2, it might be natural to consider the instability of the artefact as a random process and model the instability, e.g. as a random walk [29] or Gaussian process [30]. However, having only one stability measurement, in addition to the comparison measurements, renders those choices unidentifiable and additional assumptions are required. Following the same argument, we do not consider the important topic of model evaluation (see for example [31]) and consider all expressions derived above as conditional on the model. Also, having additional stability measurements or additional knowledge of the artefact structure would allow the justification of certain model choices. This is, however, beyond the scope of this paper.
Another important property to analyse in future work is the statistical power. Having a reasonable power estimate for the resulting uncertainties represents a crucial step towards possible guidance in interlaboratory comparisons. This will be a topic of future research.