The Computer Intelligent Selection of Scientific Research Subjects Through Ensemble Learning for Large-Scale Data Sources and Deep Neural Network

Selecting a proper scientific research subject is critical for scientific researchers and managers. Scientific researching data are from massive sources and have various attributes. For the problem of subject selection, feature extraction and prediction model play important role in performance optimization. In the paper we introduce ensemble learning method to help find the best fit attributes describing data. Our ensemble learning models include random forests, support vector machine, Boltzmann machine and decision tree. Since the data are from many data sources, we adopt multiple models of deep neural network. An acceleration method is used to reduce the training time as well. Experiments shows that the proposed approach performs better than RNN algorithm both in accuracy ratio and recall ratio. The model selection module and acceleration method help optimize the time cost largely.


Introduction
A scientific research subject is an overview on theoretical consciousness and practical tools for issues and researchers. The selection of scientific research subjects embodies thinking level, theoretical depth and practical ability of selector. Whether the selection is right or not matters in research subject application and researching development. The critical problem of selecting scientific subjects is whether they are important and advanced.
At present, the selection of scientific research subjects is a new applicable demand in the domain of scientific information. Scientific researchers and project managers should be proactive in selecting scientific research subjects. Most similar works are in the book industry. Some techniques such as time series analysis, neural network, collaborative filtering and personalized recommendation are applied. For the works of selection scientific research subjects, most works focus on document retrieval tools.
The scientific and technological data are from different data sources. Each data source has massive data attributes. The data of selecting subjects have high dimensions and large volumes. It is hard to finish accurate feature extraction and subject prediction when just using one kind of technology for large-scale data sources. Moreover, the features of selecting subjects change largely with time. When the variance is large and sudden, the false positive rate will increase rapidly. Therefore, we will propose a Selection method based on Ensemble Learning and multiple structures of deep-learning network (SEL) in the paper to solve the above problems.

Related Work
The selection of scientific research subjects becomes more and more important with the emphasis on scientific and technological innovation. Book industry has most similar works. The operating progress of selecting book topics includes information choice, topic design, topic assessment and optimization. In order to solve the problem of short-term fluctuation and long-term period, data mining and web mining are introduced.
Time series analysis model is first used in selecting book topic [1], which is based on the data of book popularity and construct a series of mathematical models. An improved neural network model is used [2] according to the characteristics of short-term fluctuation, which provides a reliable guarantee for the profits of publishing units. Aiming for the problem of subjective experience and out of hotspot, collaborative filtering algorithm is introduced into the selection system of publishing topics.
Many other works [4][5][6] also explore the application of selecting subjects. They help improve the managing quality and efficiency of enterprises and universities. However, the above works focus on function design, developing platform and system framework, which makes system optimization and selection effectiveness weak. The works in literatures [7][8][9] research the personalized recommendation technology and they improve the status of selecting subjects.
University, scientific institutes and intelligence agencies may all have the problem of selecting scientific research subjects and it is the basis and starting point of scientific research and think-tank decision. For selecting scientific research subjects, most researches are based on expert experience and from the prospective of management. The work [10] enumerates dozens of selection cases of scientific research subjects for ten years and provides the several patterns and characteristics. Scientific document retrieval system is used as a tool for selection subjects [11]. It helps select the proper subjects in the aspects of avoiding application of similar projects and determination of researching ideas.

The Framework of Selecting Scientific Research Subjects
To solve the above problems, we propose a selection method of scientific research subjects based on ensemble learning, which could accurately obtain the fine-grained subjects in a future period. The method is fit for large-scale data sources. The ensemble learning model is used to abstract the features and deep-learning technique is adopted to predict the hotspot words. We also introduce the model selection module to optimize the proposed model. Our framework of selecting scientific research subjects consists of six models, as shown in Fig. 1. They are data crawler, feature representation, feature extraction, model training, model selection and prediction modules.  Data crawler module is to collect various kinds of data from different data sources, and do the initial pre-processing operations. Feature representation module is to provide a representing method for scientific and technological texts and the texts are periodical. Feature extraction module uses ensemble learning model to extract the critical features in a period and its output is the input of model training and prediction modules. Model training module is to train the selection model based on multiple deep neural networks and generate the fine-grained hotspot words. Model selection module adopts efficiency ratio algorithm to abandon badly-performed models. The badly-performed models have inaccurate prediction results or consume much resource cost. Prediction module is to do the selection of scientific subjects based the trained models and multiple data sources.

Data Crawler
The module uses web crawler technology to obtain scientific and technological information from technology news website, scientific and technical document database. We assume that the set of the collected data in a period is denoted as t T , where t is the number of periods.

Feature Representation
The weighted TF-IDF algorithm is used to obtain the keyword vectors of t T and denoted as

Feature Extraction
Since the sources of training data are wide, the inputs of prediction have high dimensions. If we do not select or delete, there exists the curse of dimensionality. For some domains, certain data have nothing to do with the domain. For example, the subject selection in water conservancy is weakly related with media data in physics. Therefore, only the strongly related data are needed for certain domains. Without expert experience, it is complicated to choose data attributes with strong relationship. This module uses ensemble learning algorithm to do the extraction of features, which is the input of model training and prediction modules. Ensemble learning model is based on the algorithms including decision tree, random forests, support vector machine and deep Boltzmann machines. The module last chooses the best result of all the algorithms. In the paper we give the steps of deep Boltzmann machine. Its structure and parameters are set as follows.
1) Deep Boltzmann machines adopt three-layer restricted Boltzmann machine 2) The first layer is visible cell layer, which is a QB  binary matrix, which is denoted as For a model, we first delete the inaccurate words from test data. Let err to denote the prediction error. The algorithm deletes the first n words with the largest err value.
For the words with small err value, we generate its new fine-grained hotspot keywords. These words would take part in the following training. The computing method of new hotspot keywords is as follows.
We first compute the parameter which is computed Here

Model Selection
About model evaluation, we use the average error err and performance ratio indicators. The performance ratio is denoted as s, which is computed as / s err t = and t is the training time. To prevent the over-inflation of model structure, we introduce the parametert .
When the err value of a model is smaller than , the model is labelled as A model, where  is a threshold. Its fine-grained keywords are deleted from W . When err is larger than  and s ranks in the top 80 percent, the model is labelled as B model. When err is larger than  and s ranks in the last 20 percent, the model is C model. The algorithm will directly delete the C model. Here we choose all the A models as our training model. For B models, algorithm would randomly select 50 percent of them.

Prediction
The module is to predict the selected subjects based on the trained models and multiple data sources. Our prediction method is as follows. 3) If the number of selected models is massive, we need to compare all the results of each model. It is time-consuming. In the paper we propose an acceleration method. We compute the correlation coefficient between

Performance Evaluation
In the paper, we propose a Selection approach of scientific research subjects based on Ensemble Learning technique (SEL). The algorithm includes six modules, which are data crawler, feature representation, feature extraction, model training, model selection and prediction modules. In order to accelerate the prediction efficiency, we introduce the model selection module. We call the method without model selection module as SEL-NS algorithm. In the prediction module, we introduce an acceleration method. We call the algorithm without acceleration method as SEL-NA. The algorithm without model selection module and acceleration method is called SEL-NSNA.
In the section, we will evaluate the performance of SEL, SEL-NS, SEL-NA, SEL-NSNA. The comparison indicators are recall rate, accuracy rate and training time. About the benchmark algorithm, we use Recurrent Neural Network (RNN) algorithm [12]. Fig. 2 shows the comparison results on recall rate and accuracy rate between our proposed SEL algorithm and RNN algorithm. We test their performance with the number of hotspot keywords. No matter for recall or accuracy rate, SEL performs better than RNN algorithm. That is because SEL uses multiple deep learning models to predict the selected subjects and RNN cannot suit for all kinds of fine-grained domains through one structure. SEL algorithm performs 65.1% better than RNN in recall rate, and 1.3 times better than RNN in accuracy rate.

Conclusion
In order to solve the caused problem from large-scale data sources, we propose a selection approach of scientific research subjects based on ensemble learning and deep learning techniques. Since the data attributes are massive and variable, in the paper we adopt ensemble learning method including random forests, support vector machine, Boltzmann machine and decision tree, to do feature extraction. The different fine-grained domains have distinct hotspot keywords, and we generate multiple models of deep neural network. The most suitable model is chosen to do the prediction. We also consider the running efficiency of different models through an acceleration method. Our extensive experiments shows that our proposed algorithm performs 65.1% better than RNN in recall rate and 1.3 times better than RNN in accuracy rate.