Research on Text Representation Method Based on Improved TF-IDF

According to the different importance degree between feature and text, this paper proposes a method of improved traditional TF-IDF algorithm according to the position information of the words and the ability to distinguish the texts, and proposes a text representation method based on improved TF-IDF and the corresponding word vector calculation method. Finally, the effectiveness of the method is proved by experiments.


Introduction
In recent years, modern information technologies such as big data and cloud computing have rapidly developed. Massive data are imported into the Internet, and the model of document storage and management has also changed from traditional reliance on manual work to reliance on computers. It is a current hotspot that how to effectively manage information in the Internet and obtain effective information from it, of which text classification is the basis of text management. The basic process of text classification is shown in Fig.1 Figure 1 Text Classification This paper mainly improves TF-IDF method, which is one of the text representation methods.

Research Status
TF-IDF is a widely used text feature weight calculation method. In order to improve the accuracy of text classification, many scholars at home and abroad have improved it. Literature [1] carries out binary normal segmentation on TF-IDF through the significance relation of class distribution. Literature [2] uses correlation frequency to replace IDF in the algorithm and proposes a TF-RF weight calculation method. Literature [3] introduces the inter-class and intra-class differences of text categories into TF-IDF calculation method. Literature [4] improves TF-IDF by the theory of Chi square distribution. Literature [5][6][7] improves TF-IDF algorithm and proves its effectiveness through experiments. However, the current algorithm is still not comprehensive enough in improving TF-IDF algorithm. It is easy to ignore the influence of location and other information on feature weight. According to the different importance of different feature items in text related to text categories, the traditional TF-IDF algorithm is improved according to the location information of words and the ability to distinguish text, and the calculation method of text representation method based on the improved TF-IDF is given.

The traditional TF-IDF algorithm
TF-IDF is a common feature processing method, which consists of Term Frequency (TF) and Inverse Document Frequency (IDF). The calculation method is shown in Formula (1) and Formula (2).
Among them, ( , ) TF t d represents the frequency of occurrence of the word t in the text d , N represents the total number of texts, () Nt represents the number of word t contained in the text, and plus 1 is to prevent the situation ( ) 0 Nt = . However, the simple structure of TF-IDF cannot well describe the importance of feature items. TF-IDF only considers the frequency of features in the whole text, but ignores the distribution of features in the text. If a feature occurs frequently in one category but rarely in other categories, the feature should be given a higher weight. However, according to the definition of IDF, the feature may be given a lower weight, thus the weight cannot be adjusted well. And TF-IDF counts the whole text equally when it counts the features, ignoring the location information of the features. TF-IDF may also affect the weight calculation results due to the difference in the number of different categories of text in the data set.
In order to describe the features in the text more fully, the factors that affect the importance of features in actual processing are mainly as follows: • the location of the word in the text; • the frequency of the occurrence of words in a single text; • the number of texts in which words appear in similar texts; • the number and frequency of words in different types of texts are different; Considering that the of different features have different contributions in different categories, this paper proposes an improved TF-IDF text representation method by combining the proposed location information and the influence of features on text. Before processing the text, it is first necessary to standardize the feature words in the text, namely, the text normalization processing [8][9][10][11] , so as to avoid the deviation of feature statistics caused by different forms of synonyms.

Location-based Feature Processing
In natural language processing, the content words, such as nouns and verbs, have an effect on the text semantics. However, the function words, such as prepositions and conjunctions, in the text generally have no practical effect. If such words are added to the text classification process, it will cause a lot of noise, leading to a decline in the efficiency and accuracy of text classification. Therefore, when extracting text features, such noise-prone words should be removed first.
Feature weight refers to the quantization of feature weight according to its importance, which is a key step of converting text into calculation vectors. When determining the weight of a feature, it is often necessary to comprehensively consider the information, such as the location of the word, and formulate corresponding feature weight correction rules. For example, for words that appear in important locations (titles, abstracts, etc.) but its frequency of the full text is not high, the frequency of the words should be reasonably increased so as to make the final feature weight assignment more reasonable. There are several main factors to consider: • Title The title is a phrase given by the author to indicate the content of the article. It is generally required to be concise and eye-catching, including many abbreviations, which are closely related to the main content of the text. The identification of the title is helpful to accurately summarize the theme of the text, and the title itself is a high generalization of the central content of text. Therefore, the correct identification of titles can improve the quality of text classification.

•
Abstract and the first and last sentence According to a survey conducted by EE. Baxendale in the United States, the probability of the central sentence of a paragraph being the first sentence is about 85%, and the probability of the last sentence is about 7%. Therefore, the weights of features in special locations should be specially treated. In addition, the sentences in the first paragraph, the last paragraph, the beginning of the paragraph, the end of the paragraph, the title and the subtitle, the subtitle, etc. are often closely related to the main content of the article. Therefore, when dealing with the features of these locations, the weight should be appropriately adjusted.
• Syntactic structure In the text structure, there is a certain connection between the sentence structure and the importance of the sentence. For example, some general semantic sentences such as "in short" and "in summary" usually contain the central content of the text.
Combining the above, the specific method for determining the feature weight is as follows: Firstly, the text is segmented and normalized, and then the results of segmentation are preliminarily counted. Secondly the frequency of occurrence is corrected by using the feature location weight rule, and then the weight of each feature is calculated by using the weight determination algorithm.

The improved TF-IDF algorithm
This paper proposes an improved TF-IDF algorithm through the factors that affect the importance of features in actual processing. In a text classification task, if a word only appears in a certain type of text and is evenly distributed, it indicates that the word can directly reflect the characteristics of the text category and should be given a higher weight. If a word appears much more frequently in a certain category than in other categories and is more evenly distributed, it indicates that the word is related to the text category closely.
Assuming that the text category set is 1 2 { , , , , } i C C C , on the basis of TF-IDF, the original feature weight calculation method is improved, and the influence of the distribution of features between intraclass and inter-class is added to obtain the weight calculation method as shown in Formula (3). However, there is a problem in Formula (3). when a feature appears only in a certain category of words, but its frequency of occurrence may be very low relative to the whole text, such as terminology and unknown words , the value i_high P will be low, which affects the true importance of the feature. Therefore, the weight of the feature needs to be adjusted. The main adjustment objects are terminology and category-specific unknown word, namely, and words with large association with the category. the Find the average probability vital P of words appearing only in a certain type of text. For a certain word, first judge whether it appears only in a certain type of text, that is, judge whether ij ij j n n  is equal to 1.
If it is equal to 1, adjust its weight calculation method in Formula (4).

The word2vec text representation method
Word2vec [12] is a common word embedding model. It is a word vector training tool developed by Google in 2013. It has the characteristics of high efficiency and simplicity. The principle of Word2vec is to model the words and contexts in the text through deep learning, and finally obtain 100-300 dimensional low-dimensional vectors.

Word Vector Feature Processing Based on Improved TF-IDF
In text processing, different features have different influences on the text. For example, the features related to the theme and structure of the text are more important than the general features. Therefore, it is necessary to weight the feature items of the text. In the Word2vec word vector method, the usual method for processing features to obtain the vector representation method of text is weighted average or TF-IDF method [13][14] , as shown in Formula (5)  represents the TF-IDF value of word k w . However, the weighted average method ignores that different features have different degrees of importance to the text. The TF-IDF method treats each word equally and does not consider the influence of information such as location, the inter-class and intra-class on the degree of importance of features. It only explains the ability of words to distinguish text, and cannot well reflect the degree of importance and distribution of feature words. Therefore, this paper presents a weighted word vector processing method based on improved TF-IDF, considering the word location information and the ability of distinguishing words from categories in different categories. The calculation method is shown in Formula (7 (3) and Formula (4). Therefore, the text classification flow based on the improved TF-IDF text representation method can be shown in Fig. 2

Experiments
In order to verify the effectiveness of the feature processing method based on word location and word level, the following experiments are carried out to verify and analyze the experimental results. Experimental environment: Anaconda3.7, Keras, Jieba, Gensim. Experimental data: Sogou Corpus-Sogou CS, Fudan University Chinese Text Classification Corpus. Experimental indexes: The analysis indexes of NPL processing mainly include accuracy rate, recall rate and 1 F value. Among them, the accuracy rate is the precision rate of the test method, the recall rate is the recall rate of the test method, and the 1 F value is the combined value of the accuracy rate and the recall rate.
Sogou Corpus-Sogou CS collects news data from 18 channels of Sohu News from June to July 2012, including URL, title, text content, etc. Data packets in the format of .dat, with a size of 1.43GB, can be downloaded from Sogou Laboratory. Select 1,000 articles from each of the four categories, finance, science & technology, sports and entertainment. Fudan University Chinese Text Categorization Corpus is a Chinese text processing corpus developed by the Natural Language Processing Team of the International Database Center of Fudan University's Department of Computer Information and Technology. Select 1,000 texts from each of the four categories of economy, politics, history and art as experimental data.
Word2vec generally has two methods to obtain eigenvectors. The one is to use the global word vector library in the existing lexicon, which is generally large in scale. For example, the Chinese Word Vector Corpus developed by the Institute of Chinese Information Processing of Beijing Normal University and DBIIR Laboratory of People's University of China [15] includes training word vectors of dozens of common corpora such as Zhihu, People's Daily, Baidu Encyclopedia and Ancient Chinese. The another is to train the text corpus by training experimental data to obtain a word vector library. For the convenience of calculation, the latter method adopted in this paper obtains a local eigenvectors library by training the whole corpus.
First, the words in the text are standardized to avoid the phenomenon of inaccurate word frequency statistics caused by different synonyms. In order to prove the reasonability and validity of this feature text representation method, a comparative experiment was conducted by weighted average word vector method (method 1) and TF-IDF weighted word vector method (method 2) [16] and text representation method based on improved TF-IDF (method 3), in which 80% of each data set was used for training and 20% for testing. After word segmentation, the word frequency of the word segmentation result is processed by a location weighting method. In order to make the location information of words more obvious, the feature is processed by enlarging the location. According to the survey results of EE. Baxendale in the United States, and in combination with the proportion characteristics of each part of the processed text, the processing method is shown in Formula (8).
Where, res n represents the frequency of processed words, , and finally word vectors are constructed according to the text representation method based on the improved TF-IDF. Considering the relatively short news data, the word vector dimension is set to 100 and the text classifier is SVM. The experimental results are shown in Figure 3 and Table 1.
Analysis of the experimental results shows that in the two data sets used in this paper, the average accuracy, recall rate and 1 F value of text classification based on the improved TF-IDF text representation method are higher than the weighted average sum text representation method. Therefore, the text representation method based on the improved TF-IDF can improve the quality of text feature processing and is conducive to the task of subsequent text operations.

Conclusion
Aiming at the unreasonable problem of traditional text feature weight calculation method, this paper proposes an improved TF-IDF algorithm according to the location information of words in text and the relationship between feature and category, and presents a weighted vector processing method based on improved TF-IDF. Experiments show that the method is more effective than the traditional feature text representation method, which lays a foundation for subsequent text categorization operations.