Algorithmic Models for MicroBlog Information Mining System

Based on the “MicroBlog Community Massive Information Mining and Monitoring System”, this paper introduced the algorithm and model construction of microblog information word processing and mining conversion core functions of microblog information subsystem, in order to obtain the information objects needed in each link of the process of microblog information mining, which provides a basis for the realization of monitoring and early warning in the MicroBlog monitoring platform.


Mechanical Word Segmentation Algorithm
According to certain strategies, the Chinese character string to be analyzed is matched with the entries in a "sufficiently large" machine dictionary (Chinese related dictionary). If a string is found in the dictionary, a word is recognized.

Understanding-based Word Segmentation Algorithm
Understanding-based word segmentation algorithm achieves the effect of recognizing words by letting computer simulate people's understanding of sentences. Its basic idea is to carry out syntactic and semantic analysis at the same time of word segmentation, and use syntactic and semantic information to deal with ambiguity.

Statistics-based Word Segmentation Algorithm
Statistics-based word segmentation algorithm performs word segmentation by counting the frequency of combinations of adjacent co-occurrence words.
By analyzing the characteristics of MicroBlog and three kinds of word segmentation algorithms, mechanical word segmentation is more suitable for this system. Because of the short language and various forms of expression of MicroBlog text information, the frequency of MicroBlog text information is not distinguishable, so the statistics-based word segmentation algorithm is not applicable, while the understanding-based word segmentation algorithm is still in the experimental stage, and the accuracy is not high enough.
The micro-blog information mining subsystem chooses the mechanical word segmentation algorithm which is more mature and suitable for the characteristics of micro-blog text information as the basic algorithm. However, since word segmentation is an intelligent decision-making process, mechanical word segmentation method can not completely solve the two basic problems of word segmentation stage: ambiguity segmentation and unknown word recognition. Therefore, the segmentation accuracy of mechanical word segmentation algorithm can not fully meet the needs of MicroBlog information mining subsystem. We use mechanical word segmentation as a preliminary segmentation method of MicroBlog information, and then use other ways to further improve the accuracy of segmentation.

Automatic Word Segmentation Model
The automatic word segmentation model designed by the MicroBlog information mining subsystem takes the mechanical word segmentation algorithm as the basic word segmentation algorithm and makes corresponding improvements according to the characteristics of MicroBlog text information. The model for building automatic word segmentation is shown in Fig.2.
The model is divided into three parts. In the first part, the text information of MicroBlog to be analyzed is matched with the entries of Chinese dictionary and English dictionary in the word segmentation lexicon. If a string is found in the dictionary, the matching is successful. The initial segmentation strings are formed by the initial segmentation of MicroBlog word information by mechanical word segmentation algorithm.
In the second part of the model, the first initial result is scanned by using the priority recognition technology, and some words with obvious features (Words other than Chinese dictionary and English dictionary in word segmentation lexicon) are recognized and segmented twice in the initial character string. With these words as breakpoints, the second matching calculation of the string is carried out to reduce the matching error rate and improve the accuracy of segmentation.
Finally, the final word segmentation information can be obtained by comprehensive analysis of the results formed by mechanical word segmentation and priority recognition technology.
The automatic word segmentation model first classifies the text by mechanical word segmentation, then uses priority recognition technology to make up for the inaccuracy of mechanical word segmentation, and finally analyzes and integrates to form the results of word segmentation, which has high accuracy and efficiency.

Flow Chart of Automatic Word Segmentation Algorithm
The operation flow of the automatic word segmentation algorithm is shown in Fig.3.

Preliminarily Classify the Meanings of MicroBlog Words
The most commonly used positive maximum matching method in mechanical word segmentation is used to preliminarily classify the meanings of MicroBlog words. The steps are as follows: (1) Constructing a word segmentation lexicon with enough scope; (2) Extracting text information S from the text of MicroBlog to be processed; (3) Test whether S.substring (0,k) matches words in the lexicon. If so, save the length max=k. Otherwise, after k plus 1, the matching is continued in the lexicon. Repeat this step (k is the maximum length of words in the lexicon).
(4) Get the largest word. The length is max. (5) Return to (3) and continue processing S=S.substring(max), that is, the remaining characters, to get the initial segmentation results of semantics of MicroBlog information.

The Second Matching Calculation
On the basis of mechanical word segmentation, a feature scanning comparison of the previous initial segmentation results is carried out by using priority recognition technology. Second recognition and

Comprehensive Analysis
The final word segmentation information is obtained by comprehensive analysis of the above two results. The error rate of matching should be minimized and the accuracy of MicroBlog text string segmentation should be improved.

Analysis of Automatic Feature Word Algorithm
In order to extract, classify and analyze the MicroBlog information correctly, the MicroBlog information mining system needs to extract the feature words from the MicroBlog information with automatic word segmentation using the automatic feature word algorithm as the semantic representation of the MicroBlog text information.

Automatic Feature Word Algorithm Model
MicroBlog text information generally contain only a few dozen words or sentences. Therefore, the extraction of feature words for MicroBlog text information is actually equivalent to feature word extraction for several sentences.
The idea of building an automatic feature word algorithm model is based on the weighting operation of each word, and then by comparing the weights, the first words are extracted as feature words. In the design of automatic feature word model, the MicroBlog information mining subsystem is mainly realized through the following two steps: (1) According to the characteristics of Chinese sentences, the part of speech, word frequency and position are taken into account to extract the information feature words.
(2) Comparing with the existing feature lexicon to extract the specified feature words. The automatic feature word algorithm model is shown in Fig.4.

Modeling Factors of Automatic Feature Word Algorithm
The factors that affect the weight size of a word are the main factors of the automatic feature word algorithm model construction, as follows: (1) Part of speech. In the MicroBlog text, words that can identify the characteristics of the MicroBlog text is mostly the notional words in the MicroBlog text, such as nouns, verbs, adjectives and so on. However, some function words in the text, such as interjections, prepositions, conjunctions and so on, have not contributed to the identification of the characteristics of MicroBlog text. Therefore, when extracting the feature words of MicroBlog text, we should first remove these useless function words. Among the notional words, nouns and verbs have the strongest expressive power for the characteristics of MicroBlog text, so the weights of nouns and verbs in MicroBlog text should be the highest, followed by adjectives.
(2) Word frequency. It is undoubtedly very important that there are words with high word frequency in the short tens of words of MicroBlog text. It also has a strong ability to distinguish MicroBlog text. When designing the algorithm, the word frequency should be positively correlated with the weight of words.
(3) Syntactic structure. The MicroBlog text generally consists of several sentences or a compound sentence. According to the attributes of the sentence, declarative sentences have strong distinctive ability in MicroBlog text, while interrogative sentences and exclamatory sentences are not representative in content. Therefore, when designing the algorithm, the words contained in the statement should be higher in weight.
(4) Word length. Long words in Chinese words often reflect more specific and subordinate concepts, while short words often represent relatively abstract and superordinate concepts. Short words have higher frequency and more meanings and are function-oriented, while long words have lower frequency and are content-oriented. Increasing the weight of long words is conducive to truly reflecting the importance of long words in micro-blog text.
(5) Initial appearance position. Through simple statistics, it can be found that keywords generally appear earlier in the article, so the weight of the candidate words which appear in the front position should be increased.
(6) Existing feature lexicon. If a word has appeared as a feature word, it shows that the word has attracted the attention of the system. The representative meaning of the word is relatively strong, and the weight of the word should be increased accordingly when designing the algorithm.

Formula of Automatic Feature Word Algorithm
The calculation formula of automatic feature word algorithm is designed as follows: If the relevant factors information of a word i are as follows:  Part of speech X i : If it is a noun and verb, X i =2, if it is a adjective, X i =1, if it is other words, X i =0;  Word frequency P i : P i is equal the occurrence number of word i;  Word length C i : C i is equal to the letter number contained in word i;  The result of syntactic analysis F i : If the sentence in which the word i is located is declarative sentence, F i =1; if the sentence in which the word i is located is other sentence pattern, F i =1/2.
 Result of feature lexicon comparison Ti: If word i is successfully compared with the existing feature lexicon, it is the same as a word in the feature word lexicon,and T i =2, otherwise, T i =1;  Then the calculation formula of the automatic feature word algorithm is shown in formula (1) Let Q i be the weight of the word i,

Extraction of Automatic Feature Words
After calculating Qi of all the words in the MicroBlog text, ten words with Qi value in the top ten are selected as the feature words of the MicroBlog text.

Model Overview
Similar clustering model realizes the similar clustering function of MicroBlog information mining subsystem, that is, to cluster similar MicroBlog information under the same event. Due to the characteristics of short text and various expressions, the similarity between MicroBlog texts is very low, which results in that the traditional similar clustering algorithm based on text similarity comparison can not meet the accuracy requirements of MicroBlog information mining subsystem. Therefore, the MicroBlog information mining subsystem uses the method of comparing the semantic similarity of the MicroBlog text to complete the clustering. From the above, it can be seen that the feature words are the representative of the MicroBlog text information semantics. Therefore, the MicroBlog information mining subsystem uses the method of comparing the similarity of the feature words to complete the similar clustering operation. [3]

Model Design
There are two attributes worth considering in the similar clustering algorithm of feature words, one is the meaning of the word, and the other is the weight of the feature words, that is Qi in formula (1). Among them, the meaning of a word is the attribute of the feature word itself, while the weight is the extension attribute.
If the feature words of some paragraphs of MicroBlog text have similar meanings and weights, then these paragraphs are most likely to express the same event.

Semantic Similarity Algorithms
In the design of semantic similarity, the similarity algorithm of feature words under the framework of KDML (Knowledge Database Mark-up Language) is used to calculate the similarity of feature words of several paragraphs.
KDML is a new system of knowledge description norms. It is a common knowledge base which takes the concepts represented by Chinese and English words as descriptive objects to reveal the basic contents between concepts and between the attributes of concepts, containing rich lexical semantic knowledge and world knowledge. In KDML, the description of lexical semantics is defined as semantic item (concepts), and each word can be expressed as several semantic items. Semantic items are sometimes described by a knowledge representation language. The words used in this knowledge representation language are called sememes. The semantic tree of KDML organizes the sememes describing lexical semantics into tree-like structures. According to the attribute relationship between the sememes, the sememes are divided into several sememe trees. There are some relationships between trees, forming the network knowledge structure of KDML.

Design of Information Quantization Algorithms
Information quantization is the transformation of MicroBlog information. It changes the MicroBlog information from the character carrying language meaning to the value carrying quantitative information, and provides a scientific basis for monitoring and early warning of the system. The quantitative value is the monitoring value, which indicates the attention of quantitative MicroBlog events. The degree of attention to a MicroBlog event depends on the number of fans of the original author and the forwarding microbloggers and the number of comments forwarded. Among them, the number of comments forwarded is a real concern, which is the hard attention, and the number of fans represents how many people may see this micro-blog, which is the soft attention. [4][5] By synthesizing the differences between hard and soft concerns, the significance of forwarding and commenting numbers and the correlation between MicroBlog events and public security work, the quantitative formula of information is deduced, which can be expressed as follows: Let a MicroBlog event be k, the MicroBlog text contained in this event is j, the forwarding commentator of j is i, the number of forwarded comments of i for j is Z ij , the number of fans of i is F ij , the monitoring value of k is C k , and the correlation degree of k is X k ; If Z ij =5 is taken as a degree of concern, then  

Flow Chart of Information Quantization Algorithm
The flow chart of the information quantization algorithm is shown in Fig.6.

Conclusions
This paper mainly introduces the relevant algorithms and models in the process of information mining and processing of MicroBlog information mining system.
The system constructs an automatic word segmentation model to semantically divide MicroBlog text strings and analyze them for many times, which lays the foundation of system information processing, extracts the feature words from the MicroBlog text information with automatic word segmentation by using the automatic feature word algorithm to realize the correct extraction and classification of micro-blog information, and constructs a similar clustering model based on the comparative semantic similarity of MicroBlog text to realize the similar clustering function of MicroBlog information mining subsystem.
Finally, the information quantification model is deduced, and the MicroBlog information is quantified to provide a scientific basis for the monitoring and early warning of the system.