Novel approach to reducing uncertainty of drift estimate in international comparisons of relative humidity scales

International comparisons of relative humidity scales at the highest level are often challenging because of significant drift of the impedance-based transfer standard, which reduces the accuracy of comparison. Comparisons that are shorter or have fewer participants may be less affected by drift, but ultimately require more inter-linking with their inherent irreproducibility to compare all scales internationally. We addressed these issues in a large comparison of 21 laboratories from the European association of national metrology institutes (EURAMET), covering a range from 10%rh to 95%rh at air temperatures from 10 °C to 70 °C. In it, nine state-of-the-art transfer standards drifted on average between 0,6%rh and 1,8%rh, depending on the measurement point, while the participants can achieve calibration uncertainties on average between 0,2%rh and 1,5%rh. To produce useful comparison data even with significantly drifting transfer standards, we carefully planned the topology of the comparison and the number of transfer standards to be able to partly compensate for drift. This paper focuses on the approach we took in analysing the results to minimise the impact of instrument drift on the comparison outcome. We combined the estimated drift of the transfer standards together with the links drawn between the loops into so-called between-set variability. To calculate the reference value, we used an adjusted Paule-Mandel estimate where we modified the model equations to account for the three interlinked loops and solved for them using a non-linear least-squares fitting. This approach reduced the uncertainty in the drift estimates and the uncertainty in evaluated degrees of equivalence at most, but not all, of the measured values. We discuss the advantages of this approach to consider its applicability in future interlaboratory comparisons.

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence.Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

Introduction
Inter-laboratory comparisons require consideration of the key measurement properties of transfer standards, especially drift.In contrast to dew point comparisons [1][2][3], drift in relative humidity (RH) comparisons can be at the same level or even exceed the uncertainties of the compared results, which poses a challenge throughout the comparison process.RH comparisons are therefore often carried out with a smaller number of participants [4][5][6][7][8], so that the impact of drift on the final comparison result is usually smaller.However, this fragmentation requires more replicate measurements by the laboratories that link the segments, which in turn increases irreproducibility and reduces the comparability of the overall RH scales.An alternative approach is to include a larger number of laboratories and redesign the comparison topology to shorten the circulation of transfer standards and reduce the impact of drift [9][10][11].
This paper discusses a large RH intercomparison involving 21 national or designated metrology institutes (NMI/DI) from the European association of national metrology institutes (EURAMET).The aim was to compare calibration results of impedance-based RH hygrometers in the range of 10%rh to 95%rh at temperatures from 10 • C to 70 • C. The comparison was divided into three parallel (sub)loops to minimise the effect of drift, with three state-of-the-art transfer standards circulating in parallel in each loop-nine in total.The measurements took about three years, followed by a search for statistical tools to determine the reference value.
To calculate the comparison reference, we first tried a conventional weighted mean approach.However, this method proved inadequate as it failed a χ 2 -test at all measurement points due to significant variability of the results (see section 6).Recognising that this variability, referred to as between-laboratory variability, may be attributed to a drift in the transfer standard, we explored alternative approaches for calculating the reference value.One potential solution was to identify the largest consistent subset of outcomes, with acceptably low χ 2 [12,13].Another approach involved identifying the largest coherent subset that exhibited pairwise equivalence, based on bilateral degrees of equivalence to a desired significance level [13,14].Achieving consistency on the subsets were also investigated by Bergoglio et al [15] and Bodnar et al [16,17,18], who also included polynomial interpolation of the drift trend.Since no clear outliers were identified despite observing high χ 2 values, we employed an estimator proposed by Paule and Mandel [19].This method is recommended for resolving significant inconsistencies in the consensus model resulting from uncorrelated between-laboratory variability [13,20,21].Rukhin and Vangel [22] further extended the model and demonstrated that the Paule-Mandel method can be interpreted as a simplified version of the maximum likelihood method.Similarly, a method based on the criterion of zero sum of normalised residuals [23] was developed that eliminates the need for a normality assumption.Among the many methods available, no scientific or practical consensus has been reached regarding the optimal or superior approach [23].However, in our investigation, we have found that the Paule-Mandel method best fits our comparison topology with three interlinked loops and the presence of significant drift.
The article presents a new approach in the field of thermal quantities to give more confidence to the reference value in the presence of significant drift of RH transfer standards.The article does not elaborate on the comparison results, as we presented this already in a separate EURAMET report [24], but instead it focuses on the method used to increase confidence in the reference value.

Comparison topology and transfer standards
To shorten the duration, the comparison was divided into three parallel loops, in each of which three impedance hygrometers circulated as transfer standards.The participating NMI/DIs calibrated the three instruments simultaneously (triplet).The first measurements of the nine transfer standards were performed by two loop pilots UL-FE/LMK (SI) and CETIAT (FR), which piloted loops 1 and 3, and loop 2, respectively (see figure 1).These initial measurements served to establish the link between the different loops.The long-term stability was monitored by an additional calibration of the triplets in the middle of the loop and after completion of all loops by both pilots.
The participants had 2-4 weeks to complete the measurements and send the transfer standards to the next laboratory.The participants were divided into specific groups for smooth transport and administration.Each participant had to calibrate the transfer standards at 30 agreed consecutive nominal measurement points between 10%rh and 95%rh at air temperatures of −10 • C, 23 • C, and 70 • C, depending on their capabilities.
In an attempt to minimise drift, the most advanced impedance-based hygrometers were selected, the Vaisala HMT337 and the Rotronic HP22-A/HC2-S.Before being sent to the next participating laboratory in each loop, all sensors were interchangeably calibrated between the two piloting laboratories and then combined into loop triplets consisting of two Rotronic sensors and one Vaisala sensor.

Measurement results from the participants
Impedance-based hygrometers may provide poorly reproducible measurement results due to causes such as drift, nonlinearity, temperature dependence and hysteresis.However, most of these factors can be controlled at the level of the individual participant to minimise their impact on the comparability of results.We have thus set 30 measurement points, p, with different air temperatures t and RHs φ for the calibration of the transfer standards in the participating laboratories.The three sources were each accounted for by measuring at least three levels of RH (non-linearity), at different air temperatures (temperature dependence) and in a prescribed rising and descending order (hysteresis).The 30 measurement points are given in table 1.The effect of airflow on sensor response was also tested (note the asterisked points with halved airflow in table 1) [25], but proved insignificant.
For each of the three transfer standards, denoted by s, and for each nominal point p listed in table 1, participants reported a measurement result obtained at a specific time.The results were assigned a consecutive number (index i) to represent its chronological position within the corresponding loop sequence.A measurement result E(i,s,p) represents the difference between a participant's reference value φ ref (i,s,p) and an average value of the transfer standard indications, φ ind (i,s,p).Each result has an assigned standard uncertainty u(i,s,p) and expanded uncertainty U(i,s,p), assuming a 95% confidence level.
We collected over 2700 results from 21 participants, considering all pilot measurements and omitted points due to participants' limited capabilities.We then processed this large pool of data to obtain coherent sample sets by tracking down and recalculating results formulated according to the World Meteorological Organization (WMO) definition of RH at freezing temperatures [26,27].To make them more manageable, we have harmonised the uncertainty contributions of the transfer standards and combined the results obtained at the same nominal measurement points.The participants' results E(i,s,p) at the same repeated nominal measurement points for temperature and RH p h ∈ p from table 1

were combined into a new result E(i,s,h) at point h from table 2.
A combined result was calculated by the arithmetic mean according to equation ( 1) where N h represents the number of nominal points p that were combined to the corresponding new measurement point h (e.g. at 23 • C and 95%rh three were points combined, i.e.N 5 = 3; see table 1).The uncertainty of a combined result u(i,s,h) was calculated from the individual reported uncertainties of the results u(i,s,p h ) according to equation ( 2) The first component represents the mean reported variance.The second part captures the above-mentioned sources of reproducibility of the transfer standard, such as the nonlinearity and the hysteresis (but not drift-this follows below), assuming uniform probability distribution.
From this point, we compare these combined pairs of results-E(i,s,h) and u(i,s,h).Since we have performed the same analysis for each point h, we omit h from the rest of the presentation to give a more concise form: E(i,s) and u(i,s).

Drift of the transfer standards
To estimate drift, each of the nine transfer standards was calibrated by both pilot labs at the beginning and end of circulation.The pilots also performed an intermediate calibration in their respective loops (see the comparison scheme in figure 1).Their results were then separately piecewise interpolated using the linear functions E P1 (i,s) and E P2 (i,s), respectively, as shown in figure 2. We chose a piecewise interpolation over a simple linear fit in cases with intermediate pilot measurement, because we assumed it better accounts for non-linearity, discontinuity, and changes in variability (heteroscedasticity) of drift.
Note that in figure 2 on the x-axis we use consecutive measurement instances and not the time that has elapsed since the instrument began to circulate.This is because we assume that drift depends not only on time, but also on the temperature and humidity cycling, the time the instrument has been exposed to extreme conditions, transport, handling, etc.These conditions have therefore been carefully planned in the comparison protocol to minimise their effects and be as similar as possible.For this reason, consecutive instance is arguably the more relevant and straightforward measure of the drift.
For each standard s and each measurement point h, we then calculated an arithmetic mean, E P (i,s), between the interpolated functions of the two pilots to increase confidence.From this, we estimated systematic component of drift, E drift (i,s) from the onset of circulation (at instance 1) (see equation ( 3)).
For each h: where δ drift is a random error of this model.We have estimated it as zero, but the uncertainty of this estimate-the uncertainty of the drift, u drift -remains.It is calculated later along with the comparison reference value below in equation ( 15), section 6.
A detailed analysis of drift can be found in the comparison report.To illustrate its significance, the maximum E drift for any of the transfer standards was found to be 3,7%rh, and on the average, between 0,6%rh and 1,8%rh, depending on the measurement point.

Linking the results obtained with different transfer standards
The results obtained with different transfer standards cannot be directly compared.However, we can compare them relatively-relative to the pilots' initial measurements, E P (1,s), provided we also consider the drift that occurred afterwards.Therefore, we normalise the results with the pilots' measurements-both with the initial measurements at instance 1, i.e.E P (1,s), and with the latter to account for the drift that occurred afterwards, E drift (i,s) (see equation ( 4)) If we then wish to compare any pair of measurements, say i and j, with their respective transfer standards s i and s j , the bilateral equivalence (difference) D ij between them is expressed in equation ( 5) [1]: The correction δ L represents an error in the estimation of this link.Its expected value is zero, while its uncertainty is estimated according to equation ( 6) below.Assuming high correlation between repeated measurements of the same pilot, the main source of uncertainty is drift and inconsistency of measurements between the two pilots.Furthermore, in our final analysis below, we also separate the uncertainty of the drift and the links, so that the uncertainty of the links is limited to the variability of the link routes through the initial pilot measurements.

The comparison reference value
As there was no transfer standard measured by all participants, no absolute reference value could be determined.Instead, the difference-degrees of equivalence, D-between the comparison reference value, E RV , and the individual laboratory result was calculated.By linking three different transfer standards in three different loops, we obtain 27 (i.e. 3 3 ) different sets of results per measurement point h for comparison-a total of  For example: a combination c ϵ {1, 5, 7} means that we combine the results from the first loop obtained by calibrating sensor c 1 = 1, then the results from loop 2 obtained with sensor c 2 = 5 (the second of the three from 4 to 6), and the results from loop 3 obtained with sensor c 3 = 7.

Analysis with a common weighted mean
For each of these 27 sets, the E RV and the individual degrees of equivalence D(i,c) were first calculated using the weighted mean of the normalised (linked) results.The subsequent χ 2 -test failed to confirm mutually consistent results for all 297 sets.Table 3 shows the minimum and maximum values of the Birge ratios σ B [16,28], calculated according to equation ( 7) where N h,c denotes the number of results in a given set c, at measurement point h.The χ 2 -test fails when σ B is larger than 1.Table 3 shows that even the sets with minimum value of σ B have medium to large over-dispersion (σ B ≫ 1).

Alternative Paule-Mandel estimator for a subset of results obtained with the same transfer standard
The main reason for over-dispersion was assumed to be the excessive between-laboratory variance with its two different components: • the drift of the transfer standards • inconsistency of the pilot results used for loop linkage.
In order to be able to isolate these additional variances from the comparison results themselves, we extended a common weighted mean with a Paule-Mandel model [12,13].
In it, we treated our additional uncertainty components as between-set variances, and assumed they are the main cause of inconsistencies.
First, we consider the first component-the drift that affects a subset of the results within a given loop.We modify the weights to account for the uncertainty of the drift estimate for a transfer standard s according to the following equation: To obtain the Paule-Mandel equation we first rewrite equation (8) in the following form: where Var is an abbreviation for variance 6 .We then reformulate equation ( 9) to equation (10) Var [ √ w i • E (i, s)] = 1 (10) and expand equation (10) to represent the variance of the results around the mean-the reference value E RV : where N h,s represents the number of results that are included in this subset with transfer standard s at measurement point h The resulting non-linear equation ( 11) contains implicitly the variable u drift through w i (see equation (8)) and E RV (see equation ( 12)).
Finally, we obtain the Paule-Mandel non-linear equation, for which we can find a solution u drift by iteration: So, at each iteration step we first change the u drift by a step ∆u drift and then calculate the new w i , E RV and F(u 2 drift ).This is repeated until F(u 2 drift ) is satisfactorily close to zero.At this point the last values of u drift and E RV represent the solution.

Paule-Mandel estimator for a complete set of results linked from different loops
The next step is to adapt the model given in equation (13) to include the subset of results that are linked from different loops.To do this, we first extend the weights from equation (8) to include the uncertainty due to links: In equation ( 14), w i,x represents the weight of the result of a given participant x when it is linked to another result of participant i, both of which are obtained by a corresponding transfer standard from a given set c ∈ {c 1 , c 2 , c 3 }.loop(i) and loop(x) represent the loop numbers (1-3) of participants i and x, respectively.
Furthermore, for each loop/subset we formulate a separate Paule-Mandel equation with the corresponding transfer standard from a sensor combination c ∈ {c 1 , c 2 , c 3 }.The functions use normalised (linked) results.They depend on each other through a shared comparison reference value E RV .Thus, we construct the following non-linear equations for each participant x at measurement point h: For each h: where: and N h,c1 , N h,c2 and N h,c3 represent the number of results reported by participants for each transfer standard/loop and measurement point h, respectively.
The three equations represent an interdependent non-linear problem with bound constraints that cannot be solved in the same way as in [19].After considering several numerical solutions, we have finally settled with an algorithm that provided a solution in all cases-a non-linear least-squares solver that uses a trust-region-reflective algorithm [29].To implement this algorithm, we used a Matlab ® function lsqnonlin with lower and upper bounds parameters for u drift set to 0 and 10, respectively.We also tried the alternative interior-point algorithm [30], which produced practically the same results-the resulting u drift and E RV deviated by less than 0,01%rh.

Combining results of pilot laboratories
The repeat measurement results of the pilots represent a special case.Their repetition was essential to monitor the drift more closely.On the other hand, they may account for a considerable portion of all results within a given loop, so that the E RV could develop a bias towards them.Assuming that they are strongly autocorrelated, we reduce their weights in the calculation of the E RV by the number of times they occur (see equation ( 20)) where i Px represents a group of measurements i in a loop sequence performed by the two pilots UL-FE/LMK as P1 and CETIAT as P2, respectively.N iPx stands for the number of these instances.
For the same reason, the degrees of equivalence were also combined into a single result according to equation ( 21)

Comparison results
Using the equations in sections 6.3 and 6.4, we calculated 27 different results D(x,c,h7 ) corresponding to 27 different sensor combinations c, for each of the 21 participants x and 11 measurement points h.
In addition, a second set of results was created, where outliers were automatically detected and rejected from the E RV estimation.The rejection criteria were based on the 99th percentile threshold.
This large pool of results provided a large scope of finding an optimal comparison set where the between-set variances would have the least impact on the comparability of the results.Thus, for each measurement point h, we found an optimal combination c that had the highest level of confidence in the E RV estimate-the lowest uncertainty u(E RV (x,c,h)) (see table 4).For each participant x there is a cluster of results belonging to all five and three measurement points h, respectively.The uncertainty bar of each result additionally marks the expanded (k = 2) uncertainty U(x,c loop(x) ,h) originally reported by the participant for a corresponding transfer standard c loop(x) from the optimal combination set c.

Discussion
Figures 3-5 give an indication of the comparability of the results.The degree of equivalence u(D) shows a higher uncertainty at lower temperatures, as well as at higher RH.Table 5 confirms this observation numerically.
In addition, table 4 shows that the uncertainty of the comparison reference value u(E RV ) is significantly higher at −10 • C, with the highest value of 0,198%rh at 95%rh.Because the uncertainties of all the participating laboratories have the same trend of increasing uncertainty towards high RH and low temperatures, also u(D) and u(E RV ) behave the same.The trend originates from the sensitivity coefficient of RH for dew point and air temperature [31], as all participants derive their RH references from these two scales' type.Other effects are assumed to be of lesser significance.
However, degrees of equivalence do not only refer to the participants' references, but also to the transfer standard and the links between the loops to compare the results.In particular, this can be seen at low RH where the sensitivity of RH to dew point and air temperatures is small but the impedance hygrometers used as transfer standards show larger deviations.Therefore, it is crucial to separate the drift of the transfer standards and the references to achieve better comparability of results.The Paule-Mandel toolset allows for this separation, which can lead to better identification of the main cause of the lack of comparability.
Links represent a close connection between the two error sources mentioned above and are therefore important to separate from the participants' results.Figure 6 shows the relative strength of the individual variance components from equation (19) to the total variance of the degree of equivalence u 2 (D) for each measurement point h.
We can observe a large drift compared to other components, especially at lower RHs, while links contribute relatively little over the whole range.Ideally, both variances would be negligible allowing for a better assessment of participants' results.The difference between the uncertainty of the participants' results u and the respective uncertainty of the degree of equivalence u(D) is also associated with the measure of comparability.If the between-set variance remains significant, it is worth exploring alternative statistical tools to minimise the contribution of drift.The common weighted-mean approach assesses the uncertainty of drift u drift using the variability of pilots' intermediate measurements according to equation ( 22) ( max i E P (i, s) − min i E P (i, s) Alternatively, Paule-Mandel evaluates u drift during the process of finding E RV until consistency is reached.Figure 7  At points 9 and 11 (−10 • C, 90%rh), where the picture is reversed, Paule-Mandel shows how much the between-set component needs to be increased beyond the independent weighted-mean evaluation of u drift (see equation ( 22)) until the consistency is reached.Since this excess cannot be attributed to drift alone, it can be assumed that underestimated uncertainties of the participants' results could be the real reason for the lack of consistency in these points when everything else remain the same.

Conclusion
Comparing measurements that extend over weeks can be challenging, especially when many participants are involved, and the value of the transfer standard is subject to material properties that drift.Addressing such cases is crucial for future comparisons of the state-of-the-art capabilities of NMIs, which are usually metrologically superior to the transfer standards used for comparisons.
Several comparisons of similar duration have already been reported, with the most relevant being the comparison of dew point scales [1,2].However, in this comparison the drift proved to be small and did not require any special treatment.Building on these cases, we addressed the remaining drift by using an alternative estimator, Paule-Mandel, which allowed us to combine both the links and the drift into a between-set variance and minimise their effects.We succeeded in achieving this for most of the measurement points.For the points where we could not get consistent results, especially at the RH extremes of −10 • C, we nevertheless provide a measure of how much the uncertainty of the drift estimates needs to be increased to achieve consistency.When this gap is relatively large, it indicates that this uncertainty of drift alone is not the major reason for inconsistency but rather topology of the comparison (links) or the uncertainties of the participants.

Figure 2 .
Figure 2. A hypothetical estimation of drift from the pilots' measurements.

Figures 3 -
5 summarise the final comparison results obtained at these optimal combinations of transfer standards c.The graphs show the calculated D(x,c,h) with the corresponding expanded (k = 2) uncertainties U(D(x,c,h)) at different air temperatures of 23 • C, 70 • C and −10 • C, respectively.

Figure 3 .
Figure 3. Degrees of equivalence D(x,c,h) with U(D(x,c,h)) at air temperature of 23 • C; for each participant x there is a cluster of results at 10%rh, 30%rh, 50%rh, 70%rh and 95%rh, respectively (h = 1 ÷ 5); the cross marks on the U(D(x,c,h)) bars indicate the expanded uncertainties reported by the participants, U(x,c loop(x) ,h).

Figure 4 .
Figure 4. Degrees of equivalence D(x,c,h) with U(D(x,c,h)) at air temperature of 70 • C; for each participant x, there is a cluster of results at 10%rh, 50%rh and 95%rh, respectively (h = 6 ÷ 8);

Figure 5 .
Figure 5.Degrees of equivalence D(x,c,h) with U(D(x,c,h)) at air temperature of −10 • C; for each participant x, there is a cluster of results at 10%rh, 50%rh and 95%rh, respectively

Figure 6 .
Figure 6.Individual relative variances of the total variance of degree of equivalence u 2 (D) for each measurement point h.

Figure 7 .
Figure 7. Average uncertainty of the drift evaluated with two different methods-Paule-Mandel vs. common weighted-mean, for each measurement point h.
compares u drift calculated by the two different approaches, with Paule-Mandel producing lower to significantly lower u drift for air temperatures of 23 • C (h = 1.5) and 70 • C (h = 7,8).

Table 1 .
The sequence of 30 measurement points, at which the laboratories calibrated the transfer standards.

Table 3 .
Range of Birge ratios σ B for different combinations of transfer standards.

Table 4 .
The minimum value of standard uncertainty of E RV , achieved at different measurement points h.

Table 5 .
The average of u(D) of all participants for each measurement point h.