Accuracy assessment of applied supervised machine learning models on usual data probability distributions

In this paper, an application analysis of supervised classification techniques on several probability distributions is carried out. Accuracy as well as usual standard metrics have been highlighted to rate the performance of generated learning models. Using data that fit different distributions, we investigated whether the application of a classification method had an optimizing impact on the accurateness of its correlated learning model.


Introduction
The Internet of Things has emerged as one of the most exploited areas of research in modern times. This is due to the need for biosensors, capable of collecting data and exchanging them using smart devices, mostly based on cloud solutions. Several areas are inter-related to these smart solutions, such as agriculture automation, security and surveillance, smart homes and cities, machine to machine sensor networks, telemedicine, and healthcare.
In several data-driven fields, optimal decision making is the key issue. Probability theory provides us with the ability to improve decision-making systems by analyzing the in-depth behavior of the data [1]. Systems often used in the medical field tend to optimize the actions performed by practitioners. In doing so, the enormous relevance of machine learning to these industry is second to none [2][3]. However, it is also important to identify some hidden aspects of the data. in the field of telemedicine and healthcare, several applications have been designed to assist the elderly and people suffering from chronic conditions to ensure a real-time medical monitoring of their vital indicators. Early detection and diagnosis of diseases can slow the progression of illness and can lead to a significant reduction in the cost of health services [4]. Hence, Ubiquitous Systems have the capacity to streamline and enhance the access to health services, especially at the end of the ICU or when patients are allocated to home-based care [5].
Wireless body area networks are the most requested architectures in the development of these applications. They assist the medical care team in capturing data using several decision-making indicators closely related to patient's health conditions. Heart rate, body temperature, glucose, blood pressure, SO2 and so-on are prime examples of widely used life-supporting and life-sustaining measures [6]. These indicators are part of a WBAN-specific typology that periodically exchange the sensed information with a sink. Therefore, medical information can be collected, exchanged, and stored in an optimal architecture for this purpose. Upon classification of these records, and given the sensitivity of this type of data, it is necessary to develop an analytical study of the distributional and probabilistic aspects.
This work therefore required studying three of the most prevalent probability distributions in the IoT environment, namely Normal, Poisson and Uniform distributions. Several studies have highlighted the availability of data fitting those distributions, i.e. traffic management, security, medical information etc. Furthermore, every machine learning technique used for a data classification purpose includes, in an explicit or implicit manner, a few assumed assumptions about the data. The cornerstone of any data analysis or decision making is the application of the appropriate machine learning technique with the aim to obtain optimal results. Data distributions factors found to be influencing decision making have been explored in several studies, closely related to IoT [7] [8]. Several studies propose informationsensitive approaches to optimize data aggregation, data collection or the efficiency of their benchmark parameters. Authors in [9] propose an approach to collect optimized information with a predefined sampling probability to ensure performance in terms of data coverage and information amount. A probabilistic-based classification procedure is applied to normal distributions to assess the gravity of the detected medical anomalies in [8]. A multivariate normal distribution was considered to reduce the complexity of performance analysis of WBAN characteristics during walking in [10]. Along these lines, solutions for diabetic patients have been provided, in [11], a KNN-based classification management system proposed, while in [12], authors designed a health care system that facilitates rapid diagnosis.
The overall structure of this paper takes the form of four sections, including this introductory chapter, the second one deals with the experimental design, data distributions, as well as the involved machine learning techniques methodologies. The third chapter highlights the achieved results in terms of accuracy and effectiveness using several standard measures. Finally, a conclusion chapter aims to summarize this study and outlines a broader research perspective.

Methodology
IoT solutions generate a very high volume of traffic during the collection and distribution of information. These can be of a periodic or event-driven nature depending on the scope of the system involved [7], [13]. In a context of in-depth data analysis, it is worth having a global view on the data generated by a device belonging to an IOT network. The first step consists of generating sensed data using a connected device and sharing the gathered information through the Internet. Then comes the process of setting up a logical structure to apply different analysis-based approaches. A final step is the exploitation and use of these data in the coming period. This is relevant in providing the data with the ability to create more valuable information for users and researchers.

Experimental design
In this subsection, an end-to-end information management architecture is proposed to describe the process involved using a smart device. Figure 1 illustrates the experimental design. It should be noted that this paper deals solely with the Data aspect of the architecture, addressing both data distribution and data classification steps.  The proposed architecture shown in Figure 1 is depicted as follows: • Roles The architecture is made of actors and systems, each playing an important modelling function within the different processes: o Actors ▪ Patient (actor agent) ▪ Doctor (actor agent) o Systems ▪ Data collection (System agent) ▪ Analysis (System agent) ▪ Check Distribution (System agent) ▪ Data Classification (System agent) • Data collection Using a smart device, a patient can sense its Realtime vital information. Nowadays, many implanted sensors are used to gather biological data such as: heart rate, blood pressure, temperature, gesture and so on. Gathered data are transmitted through wireless technologies (Wi-fi and Bluetooth), and stored in medical-use adapted databases for a given process or task.
• Analysis While testing hypotheses and assumptions, data analysis is the process of retrieving primary data and turning it into useful information for a decision-making end. To do so, a feature selection is applied. The core principle when using a feature selection technique is to retain the subset of relevant dataset features. Then comes the contribution of a practitioner, whose role is to ensure the inclusion of management rules related to his or her field of expertise. In this case, the doctor can support the architecture by addressing the patient's auscultatory tests, as well as the normal ranges of vital signs.
• Check distribution A statistical test is carried out to describe whether experiential entries fit a usual data distribution. Goodness of fit was assessed by calculating different statistical measures, namely p-value, s-value and Anderson-Darling, Kolmogorov-Smirnov tests [8]. Data used in this study fit several usual probability distributions, namely: • Normal distribution • Poisson distribution • Uniform distribution

Data pre-processing
The purpose of this sub-section is to explain the data format. The database is a collection of sensed information that is organized so that it can be easily accessed as a decisional-based data source. Datasets consist of at least 500 decision-making readings, where a single tuple row (entry) {R} groups three items into a single compound value. The first two numeric variables are the sensed readings x and y, fitting the same probability distribution. The third one is as a ruleset-based binary decision C. A data row is thus defined by R= {x, y, C}. For every dataset, each of the x and y variables is represented by a sequence of one-dimensional vector, stacked together to make a two-dimensional vector {x, y}.
After the data generating step, comes the implementation sequence of several supervised machine learning techniques which typically involves predicting a known outcome or goal [14], which is stated in the decisional field C. For all datasets featured in this paper, we set the decision rules as follows: • if ((90<x<120) OR (60<y<80)) then C = 0; 0 states for Normal measure.
• else C = 1; 1 states for Abnormal measure or Anomaly. This will result of a tuple: {x-value, y-value, 0} for normal readings or {x-value, y-value, 1} for abnormal reading. The classification process leads us to a process with binary entries. Table 1 summarizes the number of normal and abnormal measures of the generated datasets according to each data distribution approach.

Assessment of data classification
To evaluate the optimal model among the candidates, it is bound to estimate its performance by means • The harmonic recall-precision average: Confusion matrix, also known as error matrix, is a layout table that is also used to describe the performance of the generated models. Rows represent the instances in the predicted binary class while each column illustrates the instances in the actual binary class. Table 2 depicts the used standard metrics.

Applied supervised machine learning techniques
The supervised learning techniques employed in this investigation are listed as below: The chosen techniques relied on three fundamental points: • The ability to use either classification solvers or regression solvers, or both.
• Scale-sensitive and non-scale-sensitive aspects of these algorithms • Model bias and variance analysis based on single and ensemble classifiers Table 3 summarizes the selection criteria for the techniques involved in this study: The respect of such criteria will undoubtedly make it possible to diversify the options for learning techniques while remaining unbiased as to whether one algorithm has an advantage over another.

Cross validation
Cross validation is a practical tool for estimating the robustness of machine learning models. It is frequently employed to benchmark and choose a model for a predictive modelling assignment. This approach includes a k parameter that refers to the number of sample groups in which the generated datasets are to be partitioned. In this work, the value of K has been set to five. The value of K is chosen to ensure that each train/test group of data samples is large enough to be significantly representative of the data set. Given that the datasets used in this study are composed of 500 measurements, the value of k=5 will allow us to have 5 data samples of 100 measurements each, which seems satisfactory to us. The aim is to ensure a less biased and less optimistic valuation of the model efficiency, especially during the training phase, which is very critical for the final learning outcomes. It consists first of randomly shuffling the datasets, then split all the observations into k groups of equal size, testing, training, and evaluating the generated groups individually, and finally, summarizing the models performances using the standard evaluation scores.

Results
In this section, the results obtained were grouped under the form of different plots, namely the scatter plot of the dataset, the scatter plot of normal and abnormal measures, the data distribution representation of each variable, the training and cross validation scores plots, and finally a time simulation complexity table designed to evaluate the time-duration of each dataset throughout the classification process. Figure 2 and figure 3 show the scatterplots of the first dataset generated based on the Normal distribution and the dataset variables, binary classified as normal and abnormal, respectively.               Table 4 The classification scores of the different techniques are presented in table 5. The used standards include Precision, Recall and F1-score.   Table 6 summarizes the accuracy scores of the supervised machine learning techniques used in the study conducted in this paper. Table 7 illustrates the confusion matrix of the best two classifiers, namely XGBoost and SVM. Note that Class-0 denotes the normal measures, while Class-1 denotes the abnormal ones.

Discussion
A closer look at the results indicates that the current research appears to validate the view that SVM and XGBoost return very good scores for all distributions-based datasets. The XGBoost technique, known to be one of the best performing classifiers, scored almost perfect. Dealing with the SVM technique often requires the optimization of Gamma and C parameters. From a time-complexity standpoint, the time execution belonging to [0.01,0.02] seconds provides confirmatory evidence this technique ensures very accurate outcomes especially when dealing with a decision-support system that requires short response times. The XGBoost's seamless overall performance is due to the time-consuming use of decision trees, which last within [0.06,0.1] seconds.
During this learning process, a minimum threshold of 80% has been established, and only models having calculated accuracy scores above this threshold are subsequently accepted. The day-today use of IOT devices often requires a combination of new approaches towards data classification and analysis to meet given decision-making requirements. Moreover, many optimization approaches depend entirely on the data collection process. Deciding on the right classification technique requires a comparative analysis of different available approaches. Ant yet, this was very concisely established within this research paper.
As illustrated in table 3, our study considered several parameters that must be considered when choosing a classifier for intelligent systems. Basically, classification is about predicted labels and regression is a matter of predicting a quantity. The selection of a learning technique must satisfy these two analytical grounds while ensuring accurate and time-efficient results. In the era of big data, intelligent systems operate with sensitive data processing on a large scale. This implies analytical approaches that can ensure these tasks. Thus, both SVM/SVR and XGBoost validate the required assessments dealing with data of small, medium, and large sizes. The consensus view seems to state that these two techniques offer: ✓ The ability to use either classification solvers or regression solvers, or both. ✓ Handling scale-sensitive and non-scale-sensitive data aspects IOT systems are applied across multiple areas handling different data distributions. We list several possible use cases in accordance with the distributions mentioned in this paper. •

Conclusion
In this paper, several supervised learning techniques were applied to classify decision-making related data. Because of the sensitivity of studying the internal nature of data-driven behaviour, three probability distributions were used to evaluate possible relationships between a data distribution and the most appropriate learning technique with the aim of selecting a classifier that could be useful for decisionmaking processes. The results yielded by this study provides convincing evidence that two of the classifiers returned strong results for all datasets and their associated distributions, namely SVM and XGBoost. The evaluation of the selected learning models was based on the classical standards of measurement, while also highlighting the execution time required throughout the end-to-end application. Part of the research perspective associated to this work would be to propose new approaches for optimizing the objective functions and internal parameters of these classifiers to better match them to a given data distribution.