Comparison and analysis of accuracy of various machine learning algorithms in abnormal state monitoring of internet of things devices

With the development of Internet of Things technology, more and more devices are connected to the Internet, including not only traditional computers, mobile phones and other smart terminal devices, but also various sensor devices. These sensor devices can collect a variety of environmental information and physical quantities, such as temperature, humidity, air pressure, light intensity, vibration, etc. These data have the characteristics of real-time, scale and diversity, and need to be processed and analyzed by appropriate algorithms. On the basis of previous studies, this project summarized the application of various machine learning algorithms in device state detection, compared the differences of various machine learning algorithms in sensor device detection and made comparative analysis, calculated the evaluation parameters of MSE, RMSE, MAE, MAPE, R² and other aspects of the machine learning regression model. Compare the effects of various regression models for better monitoring and prediction of equipment status. Through the analysis of a large number of historical data, different equipment state models can be established, and these models can be used to monitor and predict the current equipment state. This can effectively avoid production line downtime or other losses caused by equipment failures or abnormalities. At the same time, through the in-depth analysis of historical data, we can find some potential problems and take corresponding measures to prevent them. This project aims to summarize the application of various machine learning algorithms in device status detection, compare and contrast the differences of various machine learning algorithms in sensor device detection, realize efficient processing and analysis of sensor data, calculate MSE, RMSE, MAE, MAPE, R² and other evaluation parameters, and evaluate and compare each model. To provide more accurate, reliable and efficient equipment condition monitoring and forecasting services for enterprises and individuals.


Introduction
Masataka believes that for the security of iot devices, the number and types of devices are often large, so it is important to collect data efficiently and detect threats in a lightweight way [1].Arshiya believes that through the promotion of deep learning, the research community has solved key challenges in the field of cybersecurity, such as malware identification and anomaly detection [2].Genaro sees iot as an ally in alleviating everyday activities, from smart homes to industrial environments, some of which are critical 2 [3].The system proposed by Ferman uses a hybrid approach to select features by applying feature selection methods to machine learning classifiers [4].Francisco believed that due to the use of computers as the main tool for work or leisure, as well as the increase in jobs with high office workload, the proportion of time spent sitting has increased substantially [5].Masataka believes that for the security of iot devices, the number and variety of devices are generally large, so it is important to collect data efficiently and detect threats in a lightweight way [6].Che Xin believes that with the popularity of Internet of Things technology, more and more devices are connected to the Internet to collect various environmental information and physical quantities [7].These data have the characteristics of real-time, scale and diversity, and need to be processed and analyzed by appropriate algorithms [8].This project aims to use machine learning algorithms to analyze sensor data to achieve monitoring and prediction of equipment status [9].By establishing models of different equipment states and using these models to monitor and predict the current equipment states, production line downtime or other losses caused by equipment failures or anomalies can be effectively avoided [10].At the same time, through in-depth analysis of historical data, potential problems can be found and corresponding measures can be taken to prevent them.This project will use a variety of machine learning algorithms to efficiently process and analyze sensor data, and evaluate and compare the accuracy of each model by calculating MSE, so as to provide more accurate, reliable and efficient equipment condition monitoring and prediction services for enterprises and individuals.
Gupta Swadha studied the role of SVM algorithm model in device state detection [11]; Ma Yuchun used improved XGBoost to conduct real-time observation and record of device state [12]; Xu Huibo used BP neural network algorithm to binary classify device state, and output two states of "normal" and "abnormal" [13].Based on previous studies, this project summarized the application of various machine learning algorithms in device state detection, pre-processed the data, divided the data set, used various machine learning regression models to train the data, calculated the evaluation parameters of the machine learning regression model, such as MSE, RMSE, MAE, MAPE, R² , etc., and compared the effects of each model.To achieve better monitoring and prediction of equipment status.

Data source description
We will use the UCI machine Learning repository "Personal Household Electricity Consumption Data Set" as our dataset "Personal Household Electricity Consumption Data set" is a data set in the UCI machine learning repository that collects the per-minute electricity consumption of French households from December 2006 to November 2007.The family, who live in the southeastern French city of Sykes, have a total of 47 measuring points, including various electrical equipment such as electric car chargers, water pumps and ovens.
Since the purpose of this realization is to study and compare the effects of various machine learning algorithms on device status detection, and the research on this data set is relatively comprehensive, which is convenient for this paper to collect data and conduct comparative analysis, we choose the UCI machine learning repository "Personal household Electricity consumption data set" as the data set of this paper.

Data parameter description
The dataset contains 2,075,259 records, each of which includes seven attributes: date time, global active power, global reactive power, current strength, voltage, active power, and reactive power.Date and time are sampled in minutes, global active power and global reactive power are measured in kilowatt-hours, current intensity is measured in amps, voltage is measured in volts, active power and reactive power are measured in watts.
The goal of this dataset is to predict future energy demand and optimize energy use by analyzing the minute-by-minute power consumption of households.This dataset can be used for various machine learning tasks, such as regression analysis, time series analysis, etc.At the same time, due to the high real-time and scale of the data set, it can also be used to test new large-scale data processing technologies and algorithms.(Photo credit: Original)

Decision tree
Decision trees form models through tree structures that can be used to make predictions and classify data, decision tree training requires less data, training speed and prediction speed is faster.Decision tree models are good at acquiring nonlinear relationships in a dataset, understanding feature interactions in a dataset, and dealing with outliers that occur in a dataset.Decision trees are a very commonly used machine learning method that trains by giving some samples and ends up with a decision tree that can be used to classify new samples.We improved the structure of the decision tree, increasing the minimum sample number of internal node splits to 10, the maximum depth of the tree to 200, and the maximum number of leaf nodes to 500 for the following machine learning regression training.(Photo credit: Original)

Random forest
Each tree is generated according to the following rules: (1) the training set for each tree is different, but there are duplicate data between the training sets of different trees.(2) the best features in these m feature dimensions (maximizing information gain) are used to split the nodes.During forest growth, the value of m remains unchanged.
We improved the structure of the random forest by increasing the number of decision trees in the traditional model to 100, the maximum depth of the tree to 100, and the maximum number of leaf nodes to 500.The node splitting evaluation criteria used mse (mean square error) method.

Logistic regression model
Logistic regression model is a machine learning algorithm for classification problems.It predicts which category the input data belongs to by multiplying input features with weights and adding bias items, and then mapping the result to a probability value between [0,1] via the sigmoid function.
The training process of logistic regression models usually uses the maximum likelihood estimation method, which maximizes the probability that the model predicts correctly.During the training process, the parameters of the model are constantly adjusted to maximize the likelihood function, so as to make the prediction results of the model more accurate.
Logistic regression model can deal with binary classification problem and multi-classification problem.In binary classification problems, the model predicts a result of 0 or 1, indicating that the input data belongs to one of the two categories.In the multi-classification problem, the model can use one-to-many or one-to-one classification, that is, multiple categories are compared with other categories respectively, so as to obtain the final classification result.

XGBoost model
XGBoost (eXtreme Gradient Boosting) is a decision tree-based ensemble learning algorithm, which performs very well on large-scale data sets and is one of the commonly used algorithms in various machine learning competitions.XGBoost is an efficient implementation of GBDT.Unlike GBDT, xgboost adds regularization terms to the loss function.And since some loss functions are difficult to compute derivatives, xgboost uses the second-order Taylor expansion of the loss function as a fitting of the loss function.
The core idea of the XGBoost model is to iteratively train the weak learners based on decision trees, and then combine several weak learners into a strong learner.In each iteration, XGBoost adjusts the weights of the samples based on the results of the previous iteration, making the model pay more attention to those samples that are less predictive.XGBoost also weights the predictions of each weak learner to improve the accuracy of the overall model.

Different machine learning regression results
Based on previous studies, this project summarized the application of various machine learning algorithms in device state detection, pre-processed the data, divided the data set, trained the data with various machine learning regression models, and calculated the evaluation parameters of the machine learning regression models, such as MSE, RMSE, MAE, MAPE, R² , etc.The effect of each model is compared.Achieve better monitoring and prediction of equipment status.

Conclusion
Based on previous studies, this project summarized the application of various machine learning algorithms in device state detection, compared and analyzed the differences of various machine learning algorithms in sensor device detection, and calculated the evaluation parameters of MSE, RMSE, MAE, MAPE, R² and other aspects of the machine learning regression model.Compare the effects of various regression models in order to better monitor and predict equipment status.Through the analysis of a large number of historical data, different equipment state models can be established, and these models can be used to monitor and predict the current equipment state.This can effectively avoid production line downtime or other losses caused by equipment failures or abnormalities.This project aims to summarize the application of various machine learning algorithms in device status detection, compare and contrast the differences of various machine learning algorithms in sensor device detection, realize efficient processing and analysis of sensor data, calculate MSE, RMSE, MAE, MAPE, R² and other evaluation parameters, and evaluate and compare each model.To provide more accurate, reliable and efficient equipment condition monitoring and forecasting services for enterprises and individuals.
Decision tree is a classification algorithm based on tree structure.Random forest is an integrated learning algorithm composed of multiple decision trees, which has high accuracy and generalization ability.XGBoost is a gradient lifting tree algorithm, which can deal with high-dimensional sparse data and missing values.Linear regression is a linear regression algorithm that is suitable for processing continuous data.KNN is a classification algorithm based on distance measurement, which is suitable for small data sets.BP neural network is a classification algorithm based on neural network, which can deal with nonlinear problems.ExtraTrees is a random forest variant that is capable of handling both high-dimensional and noisy data.The selection of different models should be based on the data type, problem type, algorithm efficiency and other factors.In terms of implementation results, XGBoost regression model has the best performance, and MSE, RMSE, MAE and MAPE of XGBoost regression model are all good, and R² is followed by decision tree regression model, random forest regression model, ExtraTrees regression model and KNN regression model.The Linear regression regression model and BP neural network regression model have the worst performance, which shows that the Linear regression model and BP neural network regression model have poor performance when the data samples are relatively small.

Figure 1 .
Figure 1.Data set presentation.(Photocredit: Original) Random forests can analyze data of very high dimensions without reducing dimensionality.If you want to classify samples in a dataset, you first need to classify them in each decision tree, use weak classifiers to vote, and finally form a strong classifier:

Table 1 .
Comparison and analysis of evaluation indexes of each algorithm.