Student Network Behavior Analysis and Relevance Research Based on Optimal Decision Tree Algorithms

Based on the optimal decision tree algorithm, this paper proposes a student network behavior analysis model. Through training and predicting the academic performance of Beihang Grade 2013 undergraduates, it is found that students’ online behavior has a profound impact on students’ academic performance. To maintain a good learning state, students must strictly limit the time spent on the Internet in idle time, effectively control the time spent on the internet, and ensure that their daily sleep is not affected by the Internet behavior. The regularity of students’ life is positively correlated with their achievement ranking.


Introduction
Contemporary college students have active thinking and rich ideas, which need the close attention of student management staff to better help them grow [1]. People's thoughts and psychology must be reflected through some kind of behavior. Most of these can be depicted and represented by data, thus forming the big data of people's thoughts and behaviors, which contains vivid characteristics of the times and rich educational value [2]. The aim of measuring, evaluating and improving the quality of ideological and political work is precise. The methods and approaches depend on big data analysis, that is, big data of education and big data of students. Classified data research on students' behavior is an important way to describe students' learning, life and ideological dynamics. With the continuous accumulation of data reflecting students' behavior, it provides conditions for students to better understand themselves by using data mining technology.
Academic performance is an important evaluation index of university learning quality. There has not been clear correlation between students' academic performance and students' behavior. Based on the decision tree algorithm, this paper proposes a student network behavior analysis model. Training and predicting the academic performance of all undergraduates of Beihang Class 2013, and constantly modifying the model. Based on the model, important behavioral characteristics of students are found and studied, and some universal conclusions are drawn.

Analysis of students' network behavior
Forecasting is the core value of big data. Therefore, the big data analysis of students is to study the contents closely related to students' education and training, including early warning of academic achievement and learning status, focus analysis of students' social hotspots, monitoring and dynamic prediction of network public opinion, tracking analysis of students' online behavior, capture and evaluation of students' daily life status, automatic mapping and diagnosis of students' psychological status, etc. [2] Most of the data sources involved in these studies can be obtained from the Internet.

Decision tree algorithms
Data mining is essentially a process of knowledge discovery in databases. Generally, it refers to the process of discovering and extracting rules hidden in a large amount of complex data which have practical significance and learning value.
Decision tree method belongs to classification algorithm, and it is also one of the most intuitive and effective algorithms in classification algorithm. The decision tree first chooses attributes to establish the root node, and then generates a complete tree step by step. In each step, an attribute with the maximum entropy value is selected as the node of the branch of the tree. The decision tree completes when all attributes become branch nodes of the tree or the data of the same attributes belong to the same classification. After the decision tree is established, each leaf node is a specific class label, and the path from the root node to each leaf node is a specific classification process. At present, ID3 algorithm, C4.5 algorithm and XGBoost algorithm are widely used in decision tree algorithm. [3]

Development of student network behavior analysis model based on optimal decision tree algorithms
Taking the data of network behavior of 3199 students of Beihang Class 2013 for four years as samples, this paper uses decision tree algorithm to train and predict the average score. 80% of the total number is used for training and 20% for prediction. The error analysis of prediction data and statistical data is carried out. This paper designs "the difference between real value and predicted value" as the evaluation index of accuracy, and calculates that the accuracy of the difference within 8 points is 92%. The XGBoost decision tree algorithm is used to get the best results at present. The flow chart is shown in Figure 1. The difference between the real value and the predicted value is shown in Figure 2. The lightGBM algorithm and XGBoost algorithm are compared in Figure 3.
The data of students' surfing the internet is analysed and characterized. A total of 32 features are designed, of which the typical features are shown in Table 1.
is a function space representing a decision tree, T represents the number of leaf nodes of a decision tree. The loss function is: The goal is to minimize the loss function: The Taylor expansion is used to approximate the loss function: Then the target function becomes: Because ( ) q x has a fixed structure, the weight of leaf j can be calculated as * j  : Substituted into the objective function: Equation (6) can be used to calculate the score of tree structure q. Generally, it is impossible to enumerate the structure q of all possible trees. A greedy algorithm is used to add a segmentation to the existing leaves every attempt. Suppose that L I and R I are the nodes after splitting the left and right sub trees. For L R I I I   , the loss function after segmentation is like Equation (7): When searching for the best segmentation points, considering the inefficiency of the traditional greedy method of enumerating all possible segmentation points of each feature, XGBoost implements an approximate algorithm. According to the percentile method, several candidates who may become segmentation points are enumerated, and then the best segmentation point is calculated from the above formula. [6] XGBoost takes into account that the training data is sparse. It can specify the default direction of the branch for missing or specified values, which can greatly improve the efficiency of the algorithm, up to 50 times [7,8].
The sorted feature columns are stored in memory in block form and can be reused in iteration. Although algorithm iteration must be serial, parallelism can be achieved when processing each feature column. Storage by feature column can optimize the search for the best segmentation points. But when gradient data is computed in row mode, it will lead to discontinuous access to memory [9,10].

A discussion on the relevance between stedents' network behavior and achievement
If the decision tree is too complex, it is difficult to understand. Therefore, the decision tree should be pruned. The root node of each decision tree is the criterion, and the leaf node is the output value. The importance of the XGBoost model is represented by the average information gain. The calculation formulas are as follows: D is the data set. a is the selected attribute, and a has V values. The data set D is divided by V values, and the data sets D 1 to D V are obtained. The information entropy of V data sets is calculated and the weighted average value is obtained.
By analyzing the network behavior, the most important 5 characteristic parameters affecting students' performance and their importance ranking can be obtained, as shown in Table 2. Staying up late to surf the Internet disturbs students' normal work and rest. According to statistics, as many as 80% of students stay up late and surf the Internet at different levels, and only a small number of students are for study, most of them are for entertainment, social needs and so on. The use of smart phones has become quite common among college students. The majority of college students use mobile phones to surf the Internet in class. Of course, there are many students who inquire about the content of classroom teaching. But if they spend too much time on the internet, they will miss the key points and details of classroom content, and will inevitably affect the learning effect, which is reflected by their academic achievements.
Analyzing the correlation between students' online behavior characteristics and their academic performance, it can be found that the more bytes they use from Monday to Friday, the lower their scores rank; the more the total amount of online fees consumed, the lower their grades rank; the earlier they log on online, the lower their grades rank; the more they stay up late, the lower their grades rank. The more bytes used on the Internet, the lower the ranking. The analysis results are shown in Figures 4,5,6,7 and 8,respectively. It can be concluded that students' online behavior has a profound impact on their academic performance. In order to maintain a good learning state and ensure a better learning effect, students must strictly limit the free time of Internet access, effectively control the time of Internet access, and ensure that their daily sleep is not affected by the Internet behavior.

Conclusion
(1) Based on the decision tree algorithm, this paper proposes a model for student network behavior analysis, which combines 32 characteristics of network behavior; trains and predicts the average academic performance of 3199 students of Beihang Class 2013 in four years, and constantly revises the model. Based on the model, the important behavioral characteristics of students are found and studied, and some universal conclusions are obtained. average number of bytes on the Internet every day are 5 most important factor affecting students' academic performance. Therefore, this paper analyses the correlation between these 5 characteristics and academic performance one by one.
(3) Analysis shows that students' online behavior has a profound impact on students' academic performance. In order to maintain a good learning state and ensure better learning results, students must strictly limit the time spent on the Internet in idle time, effectively control the time spent on the Internet, and ensure that their daily sleep is not affected by the online behavior.