A Prediction and Push Method of Scientific Researching Issue Based on RNN

Exploring and predicting the hot scientific researching issues have been a focus among recent sci-tech information research. In recent years, scholars and academic literature are increasing steadily in number. It is hard to artificially track and deal with developing trends of scientific researching hotspot. The most commonly used methods in the past are simple statistical ways for keywords and high term frequency. Most of these methods spend a lot of time and manpower, and ignore the relevance between different words. In the paper we use RNN method to design our prediction and push system, which could perceive the potential researching hotspot for some time to come. The hot issues are generated based on the relevance among sci-tech words. It finds the relevant technical application to hotspots and recommends them to academic researchers. The extensive experiments demonstrate that our proposed approach has higher accuracy rate and lower false positive ratio. It can do better forward looking forecasts than word frequency statistics as well.


Introduction
Sci-tech information plays an important role on the establishment and implement of strategies and plans for nation, society and enterprise [1] [2]. The prediction of scientific researching hotspot is a new application in sci-tech information area. Science researchers and scientific project managers must have forward looking ideas on topic choice [3]. They have to consider the situations of current scientific research and social development, and make a judgment about the future novel theory or potential applicable technology.
At present the prediction method of scientific researching hotspot relies heavily on senior professional officers consulting literatures and conducting market research [4]. In addition, when a new theory or technology emerges, its relative applications need lots of work to explore. Therefore, it's imperative to design a prediction and push system of scientific researching issues. It estimates the hot research topics in a near future and push them to scientific users, which could help the work of science researchers and scientific project managers.

The Prediction and Push Framework
In order to solve the problem in section 1, we propose a prediction and push method of scientific researching issue based on RNN (Recurrent Neural Network). Recurrent Neural Network is a variety of ANN (Artificial Neural Network) [5]. It could simulates the internal dependence relationship among sequential data and is widely used in the area of natural language processing, Image annotation and machine translation. Figure 1 shows its framework. The system could timely obtains the hot topics for some time ahead. It consists of five models, which are data crawler module, feature representation module, feature extraction module, model training module, and prediction and recommendation module.
Next we give the whole generating flow of scientific research hotspots, as shown in Figure 2. Our algorithm is divided into two parts: model training, hotspot prediction and recommendation. When processing features and training models, it first crawls the sci-tech articles periodically, abstract TF-IDF [6] vectors for articles and then do the feature extraction for the vectors in a period based depth boltzmann machine. It then do the training for the hotspot prediction based RNN method and get the prediction model. When predicting and pushing hot issues, it first crawls the sci-tech articles periodically, abstract TF-IDF vectors for articles and then do the hotspot prediction based RNN using the prediction model. Through the model, we get the keywords of scientific research hotspot. Combining the relevance of clustered research words in sci-tech corpus, a series of hot scientific research issues are generated and recommended to users.

The Functional Design of Modules
In the section, we concretely introduce each module. Each module has its own function, we will introduce its implementation process in details. Data Crawler Module. Data crawler module uses the web crawler technology to crawl scientific and research articles from sci-tech websites and bibliographic databases. The crawled articles are usually textualized. The set of crawled articles during an interval of time is noted as t T , where t is the sequence number of the period.
Feature Representation Module. Feature representation module provides an approach to represent the textual features of sci-tech information and literatures, which is the input of feature extraction module.
The keyword vectors of t T are obtained based on weighted TF-IDF algorithm, which is denoted as  2) Assuming the downloading or reading quantity of j t is denoted j n and the citation quantity is j m , we get where  is the weight of keyword, n and m are the average value of all the articles j n and j m , respectively. Feature extraction module. Feature extraction module uses depth Boltzmann machine [7] to extract textual features in a period, which provides the input for model training, prediction and recommendation modules.
The structure and parameters of depth Boltzmann machine are set as follows: 1) Depth Boltzmann machine use three layers, as shown in Figure 3.
Hidden layer 1 Hidden layer 2 Figure 3. The structure of depth 2) The first layer is visible cell layer. The visible layer is a Q B 5) Depth Boltzmann machine uses restricted Boltzmann machine [8] to do layer-wise trainings. 6) For period t , assume that the output of depth boltzmann machine is t X . Model training module. Model training module is to predict the potential research hotspots some time e in the future, which provides the prediction model for prediction and recommendation module. Our hotspot prediction model is improved based on RNN approach. Its structure and training method is as follows.
1) The structure of hotspot prediction model is shown in Figure 4, which consists of t cyclic layers and three Back-Propagation (BP) neural network. Assume t X denotes the output of depth boltzmann machine in cyclic t .
In Figure 4 input weight isU , W is cyclic weight and l V is the corresponding weight of BP neural network l . Vector 0 t net denotes the weighted input of cyclic neuron at time t , which is computed as ii) Similarly, the gradient of weight matrix U is computed as  2) Cluster the scientific research words in the natural language library and generate the relevance between scientific research keywords during the clustering process.
3) The scientific research hotspots are generated according to the predicted hotspot keywords and word relevance, and then pushed to users.

Experiments and Analysis
In the section, we choose two assessment indicators: prediction accuracy rate and false positive rate, to evaluate system performance. We use two servers to deploy our system. One is for crawling data, and the other is to deploy the system and run the prediction and recommendation algorithm. We use the labelled data from Jan. 2016 to Dec. 2017 to train our model and use the data from Jan. 2018 to Jun. 2018 to predict hot issues. Through our experimental operations for six months, we select 150 hot scitech issues and 150 non-hot sci-tech issues.
According to the prediction steps, data training period is an important parameter. We set the crawling and training period as 3 days, 7 days, 14 days and 30 days. Figure 5 shows the accuracy rate and false positive rate with different data periods. As shown in the figure, the larger the period is, the higher the accuracy rate is and the smaller the false positive rate is. The false positive rate is a little high, which is because our loss function did not consider the false positive ratio. We would focus it in the future work. After we get the training model, we do the prediction. Considering the accuracy rate and training cost, we use 7 days as a period to evaluate the optimal time span of prediction. We assume that the time span of prediction ranges from one period to five periods. Figure 6 gives the comparison between accuracy rate and false positive ratio in different prediction periods. For the five prediction periods from the first to fifth period, the accuracy rate becomes lower and false positive ratio ascends. Therefore, for the 7-days crawling period, predicting hotspots within two periods has acceptable accuracy rate and false positive ratio.
Last we see whether our system can effectively and early find the hotspots. We set two weeks as a prediction cycle. The first two weeks is prediction cycle one, the second two weeks is prediction cycle  6 two, and we choose 15 prediction cycles. We choose prediction probability in our algorithm to compare the common method named word frequency statistics. The prediction probability is the value corresponding to the i-th keyword in the output . Figure 7 shows the comparison between prediction probability and word frequency statistics with the prediction cycle for a certain keyword. From the figure, we can see that in the prediction cycle 9 our method can predict its popularity in a high probability. It is largely better than the simple word frequency statistics in both accurate and forward looking forecasts.

Conclusion
In the paper we propose a method of predicting the sci-tech research hotspots some time in the future, which provides research direction and research idea for academic researchers. The relevance based on scientific research words could generate a series of hot scientific research issues. The method achieves the simultaneous delivery of relevant and potentially hot technologies or applications. Our experimental evaluations also prove that our RNN based prediction approach can predict much more accurate and timely than the common method.