BSEN: A BiSiamese Entity Normalization Method for Biomedicine

Normalization of named entities in the field of biomedicine is an important task in biomedical text data mining. Compared with other tasks in biomedical text mining research, there are relatively few researches on entities normalization. In this article, a BiSiamese entity normalization method for biomedicine (BSEN) is proposed. Firstly, the text similarity algorithm is analyzed, and an improved similarity measurement algorithm for biomedical inverse text frequency and cosine (BIC) is proposed. Secondly, the data set is trained in pairs using BiSiamese network and combined with BIC to calculate text similarity. The entity corresponding to the maximum similarity calculated in the normalization knowledge base is the normalized result obtained by the BSEN method. The verification experiments on the verification data set show that the BSEN has achieved better normalization results than the existing methods.


Introduction
Named entity recognition (NER) and named entity normalization (NEN) in the field of biomedicine have long been important tasks in biomedical text data mining. The problem to be solved by entity normalization is to collect different expressions of the same entity name, establish the correspondence between the canonical name and the variant name, and achieve the purpose of entity identification by giving the entity a unique identifier. Compared with existing biomedical entity recognition tasks, there are few researches on entities normalization. The normalization of named entities is very challenging. To achieve the entity normalization of biomedical related texts, the names of the same normalized entity and the non-normalized names of its different expressions can be connected to improve the efficiency and accuracy of query retrieval. Assigning a unique identifier to a normalized entity is conducive to establishing operations with other systems and sharing resources. The contribution of this paper is as following, 1) an improved text similarity measurement algorithm based on the combination of inverse biomedical text frequency and cosine is proposed, 2) an entity normalization method-BSEN of biomedical text based on BiSiamese network and BIC text similarity measurement algorithm is proposed.

Related Work
At present, the research on the entity standardization of biomedicine mainly includes rule-based, dictionary-based pattern matching methods and machine learning-based methods. Most normalization CISAI 2020 Journal of Physics: Conference Series 1693 (2020) 012087 IOP Publishing doi:10.1088/1742-6596/1693/1/012087 2 research relies on the method based on rules and medical dictionary matching. In a series of tasks in the BioCreative competition, the concept normalization problem has played a huge role in genes and proteins [1]. One of the sub-tasks in BioCreative V is the identification and normalization of named entities about disease names [2]. The OSCAR system [3] standardized a variety of chemical corpora, aiming at data mining of chemical substance texts. The ProMiner system [4] uses a dictionary-based approach to approximate string pattern matching to identify and standardize gene and protein names. The system uses pre-processed dictionaries, including biological entities with known synonyms. The MetaMap system [5] maps biological entities to the concept identifiers in the Unified Medical Language System (UMLS) [6], and aims to improve the retrieval of MEDLINE-related citations. The UWM system [7] uses the CRFs algorithm to extract disorder entities from clinical data, and uses the edit distance to automatically learn the change pattern of clinical terminology from the unified medical language system and task training set and uses the edit distance to automatically extract from the training set and unified medical Learn the changing patterns of clinical terms in the language system, and then normalize. In the research of machine learning based on the standardization of biomedical entities, DNorm [8] proposed by Leaman et al. adopts the method of paired learning training. Using the pipeline serial execution architecture, in the NCBI disease data set [9] has achieved good results with an accuracy rate of 82.8%, but it is easy to cause cascading errors between components. So far, TaggerOne [10] is one of the best performing machine learning-based systems on the NCBI dataset. The model uses a semi-Markov model for biomedical entity identification and standardization during the prediction period. The above two methods are combined with NER to achieve entity normalization.
In the field of named entity normalization, rule-based and dictionary-based pattern matching methods rely more on the accuracy of rules and dictionaries, and it is particularly important to choose an appropriate text similarity algorithm. The calculation of text similarity [11] is a very complicated problem. At present, there are three main methods for calculating text similarity: matching-based methods, word vector-based methods, and neural network-based methods. The method based on word matching [12] is the most classic and intuitive method, which measures the similarity of texts by comparing the frequency of the same words appearing in two texts. The word vector-based method [13] maps each word to a fixed-length vector and treats each word as a point in a high-dimensional space, so the distance between the vectors can be used to characterize the similarity between words. But the above two methods do not take into account the semantic information between the contexts.

BSEN Method
This paper proposes a biomedical text entity normalization method called BSEN (A BiSiamese Entity Normalization method for Biomedicine). The execution process of BSEN method is shown in Figure 1. In order to obtain the bidirectional semantic information and vector representation of the input text, the method is based on the Siamese network [14] fused with BiLSTM to train the data set in pairs, and then combined with the BIC algorithm to calculate the text similarity. The entity corresponding to the maximum similarity calculated in the knowledge base is the normalized result of the input entity. Finally, determine whether the normalized result is correct.

Text Similarity Algorithm
Due to the particularity of biomedical texts, the importance of core words in the text needs to be considered. The core words in the text play a greater role in matching, and the frequency of inverse text is to measure the importance of a word. This paper presents an improved similarity measurement algorithm for biomedical inverse text frequency and cosine, called BIC. The word idf value is added as a weight to the calculation of the word idf and sentence vector formula as follows (4) (5): Where D is the total number of words in the text, D w is the number of words w, v(w i ) is the vector of the word wi of the sentence, and idf(w i ) is the idf value after w i weighting. After obtaining the vector representation of the two sentences in the above manner, the word vector is accumulated horizontally. The similarity E(X 1 , X 2 ) of the two sentences is obtained by combining the cosine similarity as formula (6). Assuming that the two n-dimensional vectors obtained from the inverse text frequency are X 1 =(x 1 ,x 2 ,...,x n ) and X 2 =(y 1 ,y 2 ,...,y n ), then E(X 1 ,X 2 ) is calculated as follows: The Siamese network is essentially a method of similarity measurement. This method is suitable for the case where the number of samples in each category is small but the number of categories is large. It can be used for category identification and classification. The traditional classification method needs to know which category each sample belongs to and is suitable for the case where the number of samples in each category is large but the number of categories is small. The Siamese network makes up for the shortcomings of traditional classification methods and can learn a similarity measure from the data, which use this metric to compare and match with samples of new unknown categories, thereby being able to distinguish untrained categories. In this paper, it is designed to integrate BiLSTM on the basis of the Siamese network to obtain the context information of the text, called BiSiamese network. The network structure is shown in Figure 2.
The BiLSTM in the BiSiamese network consists of two LSTM structures with opposite directions which is composed of two opposite LSTM structures. By concatenating the outputs h i1 and h i2 of the forward and reverse hidden layers to obtain the final output representation, such as formula (7), the semantic information of the current vocabulary can be understood to the greatest extent.
This article applies the BiSiamese network to the entiy normalization of biomedical texts. The specific steps for normalization are as follows: Step 1 Input BiomedicalTextSet in pairs.
Step 2 Use character embedding to perform vector characterization to obtain a sequence of context word vectors BE=(be 1 , be 2 ,..., be i ,...,be n ).
Step 4 The output of H at each moment is averaged and input into the 128-dimensional fully connected layer.
Step 5 Using the BIC algorithm to calculate the similarity, the entity corresponding to the maximum similarity calculated in the NormalizationEntityList is the normalized result of the BSEN method.
Step 6 Determine whether the output result is the correct normalized entity.

Loss Function
The loss function used in this article is Contrastive Loss, which mainly considers the paired features of the data set in the BSEN method, and the contrast loss can be well expressed as the degree of matching of the samples, and can also be applied to the model of feature extraction , The loss functions for similar and dissimilar cases are given by equations (8), (9) and (10): Where E is the similarity E (X 1 , X 2 ) obtained by the BIC algorithm. For similar cases, the Loss function decreases monotonically with the increase of E. For dissimilar cases and E is greater than the set threshold m, the Loss function increases monotonically with the positive E.

Experimental Environment and Parameters
The following Table 1 give the experimental environment and parameters.

Experimental Data Set
The data set of the experiment in this paper is mainly composed of a biomedical Chinese text corpus crawled by the web, and the data set is manually annotated and data cleaned. The data contains a total of 17,000 items, of which 10,000 are similar to the comparison test data and the entity normalization experiment A total of 7000 data (2000 training set, 2000 test set, 3000 verification set).

Experiment Procedure
First, the text similarity algorithm is analyzed. The data set used in the experiment is 10,000 sentence texts related to biomedicine in the network. There are three main comparison algorithms, Consine, Jaccard and BIC algorithm. The algorithm gives five sets of the same data, each group of 10 sentences, and calculates the average value as the result. The results are shown in Table 2. The main metrics are the calculation time of each algorithm, and the highest similarity and lowest similarity among the five results degree. From the experimental results in Table 2, it can be seen that Jaccard has the lowest computational complexity and the simplest method, but it cannot process the semantic information of the sentence. Because the characteristic attributes of individuals are identified by symbol measures or boolean values, which is impossible to measure the size of the difference. It is more suitable for calculating the similarity between individuals of symbol metric or boolean metric; Consine method is the most commonly used algorithm, but the result is very dependent on the result of word2vec training. Although the computational complexity of BIC method is relatively large. But the effect is more accurate than the results of the other two methods. Because of the special nature of the text to the biomedical corpus, the importance of the core words in the text needs to be considered. Therefore, in the process of normalization, this paper combines the BIC method to calculate the similarity.
In the entity normalization experiment, since the Siamese network contains two branches, the training method uses paired training. The data set contains three columns, the first column is the standardized entity name, the second column is the entity alias, and the third column Is a canonical identifier. Since the entity name belongs to short text, the Siamese network uses character-based embedding to identify the similarity of structure and syntax, and uses four layers of BiLSTM with 64 hidden units in the network, where the input text feature vector Represents the average value output at each moment, and finally accesses a 128-dimensional fully connected layer, and judges whether it is correctly normalized by comparing the similarity between the entity and the normalized entity. The comparison method in this article is TF-IDF+Kmeans, and the experimental results are shown in the Figure 3 that the main evaluation index is accuracy. It can be seen from the experimental results in Figure 3 that the accuracy of the two methods tends to be stable when the number of iterations is close to 100. Finally, the accuracy of the method in this paper has reached 88.7%. Compared with the TF-IDF+Kmeans method, the method in this paper has a certain improvement in the effect. The main reason may be that the cluster-based method is more applicable when the sample size of each category is large, but the number of categories is small. And the method in this paper is more suitable for the case where the number of samples in each category is small but the number of categories is large.

Experimental Effect
For the entity normalization of biomedical texts, the normalization effect of using the BSEN method is shown in Table 3, which gives the normalization results and confidence of some data.

Conclusion
This paper proposes a BiSiamese entity normalization method BSEN for biomedical text. Based on the BiSiamese network, the data set is trained in pairs, which can well handle the case where the number of samples in each category is small but the number of categories is large, and the bidirectional semantic information of the text is obtained. Then, the text similarity is calculated by the BIC algorithm. The calculated entity corresponding to the maximum similarity is the normalized result of the input entity, and finally determines whether the normalized result is correct. The experiment proves that the method in this paper has achieved better results. In the next step of research, we need to consider expanding the number of data sets, considering parameter optimization from all angles, and trying to further improve accuracy.