Machine learning in legal metrology–detecting breathalyzers’ failures

Breathalyzers used at sobriety checkpoints undergo strict quality control by metrological institutes or police departments to ensure the accuracy of the results, thus avoiding measurement inaccuracies. This paper presents a new approach to instrument evaluation using machine learning algorithms that are capable of preemptively detecting failures. Our objective was to predict instrument failures before they occur. These faults may be errors or standard deviations that exceed the allowable limits defined by technical regulations. To predict these failures, we employed historical instrument measurement data and applied classification techniques to later label instruments as suitable or unsuitable. This was based on the instrument’s potential not to fail or fail during its operation or before subsequent checks. To increase the reliability of failure prediction, we conducted fuel cell experiments to identify which instruments have cells that could compromise measurement results. To this end, we used the K-means clustering model, which identified two clusters based on the response signals during the ethanol redox reaction. The study concluded with a wear simulation on low-performance electrochemical cells to understand whether an adjustment to the calibration curve on instruments with these cells would not compromise the instrument’s accuracy until the next check.


Introduction
The regulation of measuring instruments in national metrology institutes is implemented per the guidelines of the International Organization of Legal Metrology [1].Currently, these institutes are making great efforts to apply digital processes to the routine control of instruments such as road scales, * Author to whom any correspondence should be addressed.
Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence.Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.fuel pumps and others [2].Although the use of machine learning in model evaluation and verification processes is still in its early stages, it has great potential due to the availability of information in databases that connect information from tests carried out by different actors, such as metrological agents, manufacturers, and mechanical workshops.Thanks to today's computing power, thousands of generated data can be applied to machine learning algorithms to predict trends, understand behaviors, and identify misconduct [3][4][5].These are useful tools to predict instruments failures before they occur.
In the field of legal metrology, predictive algorithms have been reported to be used for the application of statistical control of measuring instruments.The methodology involves three steps: statistical control of conformity, establishment of control limits, and subsequent metrological verification [6].The algorithm determines whether an instrument is prone to failure and recommends appropriate actions.This model prevents the instruments from exceeding the limits established in the regulations until the next verification.
Metrology institutes must have autonomy and independence in processing the information generated.Digitizing data obtained from model evaluation and verification is essential for effective management [7][8][9].This area of legal metrology faces numerous challenges in fulfilling its role, as demonstrated by the metrological control of road scales.
These instruments require the use of standardized weights that correspond to the usual bearing load, which makes the test time-consuming and expensive due to the dimensions of the instruments.To improve the accuracy of road scale control, a study was conducted using machine learning algorithms for neural network classification [10].Based on verification data collected over two years, the authors created a classification algorithm that can determine whether a given instrument would pass the indication error test.The study achieved a reliability of 83.3%.Thus, by knowing the instrument's tendency toward a given result, the operator can take steps to reduce weighing errors.This application can be used by metrological agents during subsequent verification tests, guiding these agents to send the instrument for preventive maintenance.
Another aspect to be considered in digitalizing legal metrology is the application of intelligent sensors directly connected to measuring instruments, such as those currently used in Industry 4.0.The real-time transmission of instrumentbehavior data speeds up inspection actions and reduces damage caused by measurement errors preventing, for example, intentional fraud.
One potential application of machine learning in legal metrology is related to testing methods and evaluation of results.This application can be used to predict failures in breathalyzers used in traffic inspections.Identifying whether the instrument will develop defects during its validity until the next subsequent verification will provide insights for taking actions to minimize false negative or false positive results.This will allow for greater accuracy by traffic enforcement agents and, consequently, increase the credibility of traffic laws dealing with driving under the influence of alcohol.
In several countries, governments have established strict laws to ensure road safety and prohibit drivers from operating vehicles with high blood alcohol content.Tests can be carried out using portable breathalyzers used in traffic inspections.The authorities' main challenge is to establish safe criteria for imposing penalties on drivers.It is known that verification indicates the measurement conditions at the time of testing, that is, it is a portrait at the time of verification.Once approved, there is no guarantee that the instrument will keep its errors within limits until the next check.
Despite the reliability of these measurements, there is a risk of measurement errors because the instrument may have undetected faults.Regularly checking breathalyzers is crucial; however, determining the exact onset of instrument failure presents a challenge.Incorrect breathalyzer results can lead to unfair punishments or failure to detect alcohol concentration in drunk drivers.One approach to avoiding unfair penalties is to void all fines generated by an instrument that is found to be non-compliant in subsequent verification.
However, this type of action does not penalize drivers who genuinely drive under the influence of alcohol.Predicting instrument failure in advance is essential for implementing preventive measures.
This work proposes the creation of a predictive model for breathalyzer failures using measurement data during subsequent verifications.The analyses were divided into two main stages.The first step involved creating classification models that could distinguish which instruments would remain within error limits until the next check and those that would tend to fail before the next check.The second step involved analyzing the data using a clustering model.This was also done to predict breathalyzer failure,but now based on the performance of the electrochemical cell built into each instrument.
Finally, we simulate fatigue on the instrument by applying successive measurements to observe the deviation of the readings.This was done to ensure that a possible adjustment did not compromise the instrument's performance by exceeding the maximum errors allowed.

Methods and results
To conduct the study, we extracted test results from subsequent verifications over a five-year period.Sampling and preprocessing were important steps in building the machine-learning models.
Our objective was to predict instrument failures before they occured.To train the models, we used error and standard deviation results in three concentrations.We created a binary dependent variable (classes) for the collected observations (figure 1).This variable was labeled as 0 or 1, where zero represented an instrument passing two subsequent verifications, and variable 1 represented an instrument that passed and subsequently failed verifications, as shown in figure 1.
We excluded samples with intervals greater than 380 d from the data set because they did not present a fixed sequence between verifications.We only considered training breathalyzers that were verified with dry gas reference material.

Classification models
A common problem when using supervised learning in classification algorithms is finding disproportionate targets.This problem was no different in this study: the samples selected for training had a ratio of 1:6.3 (samples 1 and 0, respectively) (figure 2).
Training the model with unbalanced data would result in greater learning of the features of the majority class, while the minority class would often be misclassified [11].Among the various existing techniques to eliminate the disproportion of class types, we chose to combine neighborhood cleaning rule and random under sample, which were responsible for removing outliers and excluding samples from the majority class, respectively.
Once the pre-processing stage was concluded, it was followed by the application of the models and the analysis of their metrics.After pre-processing the dataset, we used the PyCaret library [12,13] (an open source Auto-ML in Python), as it is extremely simple and allows it to be replicated in future approaches.
Despite allowing configurations, through the 'setup' command, we used its default version as we considered that the library itself would make necessary adjustments to find the best model, with hyperparameter optimization.Pycaret also performed data pre-processing, splitting 70% for training and 30% for testing.
The Extra Tree Classifier model had the best performance, with precision, accuracy, and recall results of 77%, 75%, and 74%, respectively (table 1).
Our approach aimed to not only observe results but also propose actions.These actions were based on the statistics that the model returns for the labels.To achieve this, we created a Confusion Matrix with the Extra Tree Classifier model (figure 3) and visualized the distribution of true-positive, falsepositive, true-negative, and false-negative results.
By understanding this distribution, we could identify which results had the greatest impact.In our application, we believe that it would be better to have fewer false negatives than false positives.False negatives would be instruments labeled as likely to fail before the next verification, but which would not actually fail.This would cause unnecessary costs when we sent them for maintenance.
Experts must analyze the metrics to make decisions and direct efforts toward actions that negatively impact society.These decisions will differ depending on the type of instrument and the field of application.

Clustering model
Knowing how the breathalyzer is constructed and the main components involved in the measurement allowed us to infer that failures in any of its components could make the measurement unfeasible or, in the worst-case scenario, lead to incorrect readings.When the lung air sample contains ethanol molecules, an oxidation-reduction reaction occurs and electrons are released.
The microprocessor monitors the entire measurement cycle and delivers the ethanol concentration result in mg l −1 .When a problem occurs during measurement related to temperature or flow, the temperature, flow, and pressure sensors detect and send these signals to the microprocessor.An alarm is then issued, making the measurement impossible.
Data generated by breathalyzer repair and maintenance workshops used in Brazil showed that, of the six main components, the electrochemical cell (fuel cell) had the highest incidence of repairs (40.20%).The fuel cell can wear out in dry environments, reducing sensitivity and, consequently, causing measurement errors or poisoning of the platinum electrode [14,15].
Next, we have the flow sensor (constituting 26.51% of repairs), which interrupts the reading if there is an interruption in the exhaled air-flow.The third component, with the highest incidence of repair, was the temperature sensor (16.96%), which ensures that the instrument operates within the range specified by the manufacturer (figure 4).
Given that the fuel cell is a critical component in the breathalyzer measurement process, we mapped its operation by recording electrical current readings as a function of time.By studying the deformation of the curve, it was possible to identify the limit between good and degraded sensors.This would enable us to determine whether an adjustment or replacement of this fuel cell was necessary at the time of verification.
In the first round of tests, we only had four discrete points of current and time readings which did not allow us to build a reliable curve.We then applied a clustering model to separate two groups of behaviors related to electrical current and time.We chose the K-means clustering model, using k = 2 (number of clusters).This unsupervised learning classifier extracts and interprets data by grouping it according to its familiarity.
The classes' visualization was obtained by the sns.pairplot function present in the Pandas library, which generates a pair plot with the input variables (figure 5).
With the addition of hue to the identified classes, we could see a separation between the two categories.According to the model, this distinction was made by grouping the closest points (similarity).We already knew that curves whose cells have better performance (new ones) present very different values from worn cells.For example, they tend to exhibit higher peaks and shorter response times.What we wanted to find the threshold of this transition and, thus, indicate the exact moment to withdraw the instrument from use.
We distributed our analysis to five input variables related to the categorization of the model, which were: Peak which were separated from the limit of 500 A, with values above this classified as 0 (blue) and lower values as 1 (orange).T-Peak IN, T-Peak FN and Time which are the times identified as the peak start, peak end and total time of the measurement curve, respectively.The lowest values for the three inputs were classified as 0 (blue) and the highest values were classified as 1 (orange).Finally, the measurement input variable presented an overlap of values, making it impossible to identify a clear separation according to the values presented during the reading of the sensors; however, this, in general, did not affect the final analysis of this model.
From the above results (figure 5), we can see the possibility of separating different groups based on the behavior of the points of the measurement curve (diagonal of the graph).
The clusters overlapped when we analyzed the alcoholcontent values, which can lead to erroneous interpretations.Thus, this evaluation was not related to the value read by the instrument with one reference, but to the shape of the curve generated during the measurement.
It is possible to separate the instruments based on the response of the electrical current generated during the measurement.However, a worn-out cell does not necessarily require replacement.

Temporal study of fuel cell wear
The deterioration of electrical and mechanical components in measuring instruments and systems has been studied extensively, as it aims to maximize the use time and minimize failures.In this way, we know that the adjustment is a widely-used resource usually performed during calibration.
This practice aims to reduce the maintenance cost, but on the other hand, raises the alarm in the sense that there may not be instrument stability, in other words, not remaining within the maximum errors until the next verification (12 months into the future).
In this way, it was proposed to carry out a temporal study, now using only the spent cells, to confirm whether they remained stable in order to respond satisfactorily until the next verification.
The study does not aim to predict future results, as is a usual application of time series, but rather to understand the trend and possible future values to conclude the feasibility of using used and adjusted cells.
The selection of the cell of interest (expense) was carried out after a round of readings using different types of fuel cells.The plotted data is shown in figure 6, where the concentration decay during measurements is noticeable.
It is seen in the figure above that there is a downward trend in the indication of ethanol concentration by the instrument by approximately 10%.Regarding seasonality, although we observed plateaus (measurement stability) followed by a decrease in the indication, there is no way to state that this behavior will be faithfully mirrored in real usage situations.For this, field collections would be necessary for a minimum period of 12 months.However, this is a strong indication that there is instability in the measurements, which could compromise the accuracy of the readings.
Based on current rules of instrument use, we know that instruments, after repairs, should show errors of less than 0.020 mg l −1 .However, when in use, errors of up to 0.032 mg l −1 are allowed.In other words, the tolerance margin for decay can reach 60%.This was much higher than what was observed during the experiment.
In our case, although we have a visualization of the instability, the tolerance allowed in the standard guarantees a safety margin to meet the error determined in technical regulations.

Discussion and conclusions
Metrology is committed to ensuring accurate measurements.Therefore, it is necessary to develop alternatives to improve the processes inherent to measurement instruments and methods.For example, in controlling the breathalyzer used in traffic inspections.
Prediction models with machine learning techniques are increasingly used to achieve results in the shortest possible time and in the most accurate way.They are currently used in different segments of companies to find potential customers, predict the weather, diagnose diseases, offer products, prevent failures, and increase profitability.In metrology, result prediction has been applied in calibration, but with great potential for testing and metrological supervision.This study addressed machine learning techniques to be used in classifying and grouping breathalyzers to identify failures before they occur.
For the classification of breathalyzers, the results show the complexity of understanding the behavior of instruments in use before they present an error above the permitted value.The metrics found for classification show that the Extra Tree Classifier model is the best classifier, with an accuracy of 75% and a precision of 77%.Although satisfactory, it leaves the task of managing false-negative and false-positive results to the authorities.
To deepen the study of the prediction of failure in breathalyzers, we carried out simulations of use to understand the behavior of the instrument.In this step, we simulated the behavior of the measurement curves by grouping new and worn electrochemical cells.It was possible to establish two categories, which in a first analysis, the curves were related to the response of the chemical reaction of the cells in the presence of ethanol.Finally, we performed measurements on the spent cells to establish how well they could be calibrated and continue to be used.As a form of analysis, we used a time-series model that showed a decay rate but not to the point of reaching the maximum allowable errors.This provides greater reliability to instruments that leave maintenance workshops.
Once the accuracy of the model is determined and combined with the classification analysis (Extra Tree Classifier Model) we would have a valid assertion to remove the instrument in use, thereby reducing the error rates of the instruments in use (supervision).
In this context, we can observe the importance of not only having the metrics of the classification model but, also the evaluation of the different impacted scenarios and the carrying out risk-management.
Therefore, before applying the models, it was necessary to analyze the data and understand their distribution, constructive components, and regularity in the presentation of defects.This is one more benefit of combining computational language, statistical analysis, and metrological control when working with a large amount of data.
Finally, we hypothesized the possibility of continuing to use the spent cells, through calibration and adjustment, thus extending the useful life of the cell for at least one more period (12 months).When the cell decays the measurement curve, there is greater instability with a reduction of approximately 10% its accuracy in a cycle of approximately 200 measurements.
During the study, the idea of using time series models to predict the exact moment when the cells will reach errors above the allowed value was considered; however, a longer study would be necessary for sufficient historical terms to train the models.
The main objective of this study was achieved, while new issues were identified which will lead to future work.The first concerns the need to improve the database architecture to facilitate extraction and processing.The second issue relates to enabling access levels for external users.In addition to providing transparency to metrological control, it would allow other researchers and analysts to conduct studies and suggest improvements.Parallel to this availability, information security must be considered to guarantee the privacy of the instrument holders.
Finally, the use of machine learning in the construction of digital twins is suggested as these would virtually represent the instruments and measurement processes for each embedded model of the electrochemical cell.With this, we would systematize the metrological control using concrete applications of artificial intelligence.This new form of action would only be possible with the training of metrologists and investments in computational infrastructure and security.The use of machine learning would result in lower costs and, higher productivity and quality.This would result in greater confidence in the measurement process.

Figure 1 .
Figure 1.Criteria for selection and labeling of samples to compose the data set for training and testing.

Figure 2 .
Figure 2. Class count: 0 for instruments that did not fail subsequent verification and 1 for those that failed.

Figure 3 .
Figure 3. Confusion matrix for Extra Tree Classifier model.

Figure 4 .
Figure 4. Distribution of frequency of repairs performed on breathalyzers by authorized workshops in Brazil (data from 2017 to 2020).

Figure 5 .
Figure 5. Pair diagram with input variables and grouping according to the two categories found in the Naive Bayes algorithm.

Figure 6 .
Figure 6.Temporal distribution of the values indicated in the fuel cell over 206 consecutive measurements.

Table 1 .
Classification models comparison result.