Evaluation of word embedding models used for diachronic semantic change analysis

In the last decade, the quantitative analysis of diachronic changes in language and lexical semantic changes have become the subject of active research. A significant role was played by the development of new effective techniques of word embedding. This direction has been effectively demonstrated in a number of studies. Some of them have focused on the analysis of the optimal type of word2vec models, hyperparameters for training, and evaluation techniques. In this research, we used Corpus of Historical American English (COHA). The paper demonstrates the results of multiple training runs and the comparison of word2vec models with different variations of hyperparameters used for lexical semantic change detection. In addition to traditional word similarities and analogical reasoning tests, we used testing on an extended set of synonyms. We have evaluated word2vec models on the set of more than 100,000 English synsets that were randomly selected from the WordNet database. We have shown that changing the word2vec model parameters (such as a dimension of word embedding, a size of context window, a type of model, a word discard rate etc.) can significantly impact on the resulting word embedding vector space and the detected lexical semantic changes. Additionally, the results strongly depended on properties of the corpus, such as word frequency distribution.


Introduction
Word embedding (word2vec) models are now widely used in distributional semantics and for solving a variety of natural language processing tasks.Many different methods and parameters can be used to train word2vec models.Therefore, specialized quality assessment tests are required to select the optimal type and model parameter settings.In the training process, quality of the word2vec model can be quantitatively expressed by the value of the loss function; however, it will be uninformative.Most often trained embedding models are evaluated for solving a specific task, for example, part-of-speech tagging or named entity recognition [1].
As this type of assessment is time-consuming and difficult to interpret, distributional semantic models are often evaluated using an established set of simple tests, such as word similarity or analogy tests [2].Such evaluation methods are a good way to quickly evaluate the quality of a model.
Popular tests used to evaluate Distributional semantic models focus only on a subset of words of the trained model, evaluating selected word pairs.For example, WordSim-353 [3] tests the behavior of only 353 word pairs; therefore, the model performance cannot be reliably assessed.To better understand the semantic structure of Distributional semantic models and the type of semantic relationships they encode, some alternative datasets have been specially developed or evaluating Distributional semantic models [4].Although these datasets provide greater coverage of semantic relationships between pairs of words (including hyperonymy, synonymy, antonymy, meronymy), one still needs a more reliable way to estimate the impact of parameter variations on training Distributional semantic models.Some extensive studies have been performed that compare a large number of configurations created by changing one or more parameters.Levy et al [6] found that careful setting of hyperparameters can be (in some cases) more advantageous than adding more data in the model training process.Hellrich and Hahn [7] used diachronic corpora to evaluate reliability of word embedding neighborhoods.The results showed that choosing the right parameters when training word2vec models improves performance on certain types of problems and can also influence the type of semantic information captured by the model.It is shown in [5] that performance of word2vec models is influenced by various factors including a text corpus, preprocessing of the corpus, the chosen architecture and hyperparameter set.However, they also noticed that the influence of some parameters is ambiguous and sometimes contradictory.
In addition to the model parameters and settings in the training process, a significant role in estimating the quality of models can be played by random distribution of weights when initializing the model at the beginning of the training.Work [9] shows that repeated training of word2vec models allows one to estimate confidence intervals for analyzing significance of changes in semantic distances between words.
This study proposes an extended methodology for estimating the performance of word embedding models used for lexical semantic change detection tasks.For these applications, word embedding models should provide similarity of vectors for words that are similar in meaning because neighborhood analysis of word embeddings underlies the assessment of diachronic semantic changes in a language.Unlike the above tests of similarity and analogy of words, we will estimate not only the selected pairs of words but expand the sample using the WordNet database.The reason to extend the database is that a sharp increase in the number of examples will increase statistical significance of the obtained results.Since the WordNet database contains sets of synsets without quantitative estimates of word proximity, we propose an approach based on rank statistics.We use the proposed approach to compare the performance of Distributed semantic models employed in lexical semantic change detection tasks.We will also consider how far certain parameters affect the performance of diachronic models.To conduct comparison, we also perform standard analogy and similarity tests.

Datasets
The Corpus of Historical American English (COHA) [11] was used to test the pro-posed approach for further analysis of diachronic changes in the distribution of words.The COHA corpus is available as a regular full-text corpus grouped at decade-level granularity.An important advantage of the COHA corpus is that it contains carefully selected texts and includes approximately the same size of text information of different genres in each decade.However, many rare words are missed from COHA; therefore, it is useful only for the analysis of relatively common terms.Document genres (e.g., fiction, magazine, newspaper) were selected according to the Library of Congress categorization scheme.COHA contains over 400 million words in approximately 107 thousand documents published between the 1810s and 2010s.
In this study, we tested word2vec models trained with different parameters on a sample of COHA texts for each decade from 1920 to 2010 (9 decades in total), the total corpus contained 99.3K documents and over 250 million tokens.Vocabulary sizes of the preprocessed texts are distributed over the time as the following : 1910 -41778, 1920 -47401, 1930 -48475, 1940 -48619, 1950 -51648, 1960 -51113, 1970 -52178, 1980 -57120, 1990 -61924, 2020 -65836.The vocabulary size of sub-corpuses for different decades are significantly different.That's why we can expect different properties of the distributed semantics models for different decades.
To expand the sample of word similarity tests, we used the WordNet database which is a large electronic lexical database of the English language [12].WordNet is a large semantic network, a graph in which words are connected to each other using labeled arcs representing semantic relationships.Individual words are in lexical relations with other words, whereas concepts, that can be expressed by more than one word, are in semantic-conceptual relations.Synonymy reflects a "many to one" comparison of word forms and concepts.WordNet groups synonyms into unordered sets called synsets [13].Replacing a member of a synset by another one does not change meaning of the context, although (in some contexts) one synonym may be stylistically more suitable than another one.A synset lexically expresses a concept.Examples of synsets are mail/post, hit/strike and small/little.Thus, synsets can also be used in tests of similarity and analogy of words when analyzing word2vec models.
As a reference word2vec model, we used pre-trained word vectors that were trained on English versions of Common Crawl dataset using fastText library [14].fastText is a library for efficient learning of word representations and sentence classification.The model was trained using CBOW with positionweights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.

Quantitative evaluation methodology
So, we have 117,791 synsets extracted from the WordNet database, including a total of 147,478 word forms.Having a trained vector representation, we can estimate the cosine distance for all other words from the target word.It is natural to expect that for a good model the synonyms will be in the group of the most similar words.We can quantify this by calculating the average rank of synonyms in a sample of words ordered by increasing the cosine distance: Here i is the number of the selected word from the synset including M words, and  | is the rank of the k-th word of the synset in the sample of words ordered by the increasing cosine distance from the ith word.It should be noted that ranks are calculated for the entire sample of words that posess vectors.
Let us calculate the average rank for the words of the synset: ∑ ∑  | ≠  Thus, we get a single number ⟨r⟩ for a synset, which shows how well the model grouped the synonyms.For example, small average rank values of ⟨r⟩ indicate that synonyms are closely located in the vector space.
If we need to compare 2 models A and B, we calculate the average ranks 〈  () 〉 and 〈  () 〉 for each synset, respectively, using the vectors of these two models.Next, among N synsets, we count the percentage of p cases so that:

Results
To illustrate the proposed approach, consider the impact of such parameters as architecture, window size, dimension, and the number of epochs on the performance of the models.Variations of the studied parameters are presented in Table 1.When choosing parameters and their values, we were guided by similar studies [5], [6].
We trained the word embedding models using the widely used fastText library [12] with the variable parameters from Table 1 and fixed parameters (Minimum count of words -5, Number of negative samples -5, Factor to determine word discard rate -10 -4 ) .The most complete set of variable parameters were trained on a corpus of texts written between 2000 -2010.Ten models were trained for each of the 28 sets of parameters to evaluate the difference between the models caused by random distribution of initial weights and change of a specific parameter.Thus, the total number of the trained models for the target text corpus was 280.
Table 1.Values of the variable parameters.To get a general idea of the quality of the resulting models, we conducted comparative tests on the Wordsim-353 and Simlex-999 [16] sets, which are most often used to solve such problems.Fig. 1a shows the values of the Spearman's rank coefficient calculated for a series of similarity values from datasets and cosine distances between words.The X axis corresponds to variations in vector dimensions used when training the models (see Table 1).Fig. 1a   According to the results of the basic tests, increase of the model dimension improves the model performance for both test sets.At that, there is no significant difference between the Skip-gram and Cbow models applied to the Simlex-999 set; concerning the WordSim-353 set with vector dimensions above 100, the CBOW model shows slightly higher performance.Figure 1b visualizes summary statistics with box plots for the Skip-gram and Cbow models when tested on a set of WordNet synsets compared to the results obtained using pretrained fastText vectors (see Table 1).The figure also shows improved performance of both architectures; however, the Cbow model shows advantages for all values of the dimension parameter (the higher the parameter values on the Y axis, the higher the performance of the model).As the dimensionality increases above 100, the performance of the Cbow models slightly changes.It should be noted that for all the performance tests considered, the effect of changing the Vector dimensions parameter is statistically significant against the background of the scatter of the estimated statistics due to random weights when initializing models before training.Fig. 2 visualizes results for all models with variable size of the context window using WordSim-353 and SimLex-999 datasets (see Fig. 2a) and WordNet symsets (see Fig. 2b).It can be seen that decreasing the size of the context window improves the performance of the model for all test sets under consideration.For simplicity, we present the results for the Cbow models since Skip-gram has already shown lower efficiency on the tests from Fig. 1.Thus, the best performance is observed with the size of context window equal to 5.

Parameter
To solve the tasks of detecting diachronic changes in word semantics, it was necessary to analyze the approach proposed in the work for assessing the performance of word embedding models on the COHA text corpus over different decades.To do this, the COHA corpus texts from each decade of 1920-2000 (a total of 9 sets of texts) were used.Based on previous experiments presented in Fig. 1 and 2, the following combinations of embedding model parameters were chosen for further calculations: the number of epochs (10, 40 and 80), window size (5 and 10), embedding dimension (100 and 300), model type ('Cbow' and 'Skip-gram').Each model was trained 5 times to assess the statistical significance of the results.
The result of comparing the models using different numbers of training epochs is shown in Figure .Here the number of 10 and 40 of epochs were compared ('Cbow' model, dimension -300, window -5).Fig. 3 shows that (for all decades) 40 training epochs improve the model performance according to the criterion presented in this work.When comparing the performance of models with 40 and 80 training epochs, one can see that there is no significant improvement from doubling the epochs; taking into account the relatively small size of text corpora for each decade (the size of the dictionary on average included 50 -70 thousand word forms), one can say that a large number of epochs could lead to overfitting of the models.
The result of comparing the models using the context window of different size is shown in Figure 4. We compared the window sizes of 5 and 10 ('Cbow' model, dimension -300, number of epochs -40).Fig. 4 illustrates that, for all decades, the window size of 5 shows slightly higher performance, as was seen in Fig. 2. At the same time, the performance improvement is not significant, which is quite expected when using the 'Cbow' architecture.The result of comparing the models of different architectures is shown in Figure 5.The architecture types 'Cbow' and 'Skip-gram' were compared (dimension -100, number of epochs -40, window -5).It can be seen From Fig. 5 that, for all decades, the 'Cbow' model shows better performance.
The result of comparing the models of different embedding dimensions is shown in Figure 6, here embedding dimensions of 100 and 300 were compared ('Cbow' model, window -5, number of epochs -40).It can be seen from Figure 6 that for a higher embedding dimension there is a slight performance improvement for most decades, a higher embedding dimension does not provide a significant performance improvement, which can be explained by the relatively small amount of corpus texts available for each of the decades, as in the case with the number of training epochs.Evaluation results for all decades with variable type of models: 'Cbow' and Skip-Gram (dimension 300, epochs 40, other parameters from Table 1).Evaluation results for all decades with variable vector dimensions: 100 and 300 (model 'Cbow', epochs 40, other parameters from Table 1).
At the end of the section, let consider how the discussed parameters of the models affect the task of lexical semantic change detection.The words gay, tape, plane were chosen as examples because a number of works (see, for example, [15]) indicated that these words gained new frequently used meanings in the 20th century.We found neighborhoods of these words using embedding models trained on data from different decades with different sets of parameters.Ten most similar words in a given decade were found for each word.Having a set of independently trained models for each word and each decade, we calculated the average percentage overlap of the word's neighborhoods in different decades using a set of independently trained models.The calculation results are shown in Figure 7.In the figure, the first line corresponds to the results of the diachronic analysis using model A (dimension -100, window -5, epochs -40, 'Cbow') employing the example of the three words given above.The figure also shows the results obtained by model B (dimension -300, window -5, epochs -40, 'CBOW'), model C (dimension -300, window -10, epochs -40, 'CBOW') and model D (dimension -300, window -5, epochs -80, 'Cbow').Figure 7 shows that models A and D show insufficiently contrasting changes in the distributional semantics of the presented words; we previously observed similar results when analyzing the performance of models on an expanded set of synsets (see Fig. 3 and Fig. 6).Model C, in comparison with model B, is also loses advantageous when assessing changes in distributional semantics, which was confirmed by the results presented in Fig. 4.
Additionally, in Table 2 we present the dynamics of changes in the most similar word forms, according to the results of the diachronic distribution analysis shown in Fig. 7 for model B. Table 2 and Fig. 7 show the Estimated change point, that is, the decade during which the maximum changes in meanings for the word 'gay' are observed in the 1970s, 'tape' -in the 1950s, 'plane' -in the 1920s.The results obtained are qualitatively correlated with the results by other authors, for example, [15], however, in our work, new meanings of words are revealed in much earlier decades.These differences can be explained by the difference in the corpora used ([15] used the diachronic corpus Google Books Ngram), as well as by a more careful selection of optimal parameters for word embedding models based on the criterion proposed in this work.Dimensionality of the word embedding model can also be determined by the considered NLP problem.For example, tasks that require high accuracy of semantic precision, such as sentiment analysis or machine translation, can benefit from multidimensional embedding models.However, simpler tasks such as named entity recognition or part-of-speech tagging may not require such a high multidimensional embedding [6].
Dimensionality of the word embedding model also affects the computational resources required to train and use the model.Higher-dimensional embeddings require more memory and computational resources; therefore, it is important to consider available computing resources when deciding on the dimensionality of word embeddings.
When assessing the performance of the models during training, the measure "the per-epoch loss" is not a direct measure of the performance of the model in relation to the external application task.This is just an indicator of whether further training helps solve the internal optimization problem.Sometimes the quality of the model can deteriorate as the number of epochs increases, this often indicates that overfitting is occurring.That is, the (possibly oversized) model remembers features of the (probably undersized) training data and thus solves better its internal optimization problems, but not its external application problem [21].Considering the COHA corpus, no significant advantage was observed between 40 and 80 training epochs, which may be explained by the relatively small corpus size available for each decade.Thus, all above shows that the criterion proposed in the work for assessing the performance of models provides consistent results and can be used to select optimal word embedding models in lexical semantic change detection tasks.

Conclusions
This paper proposes an extended methodology for evaluating the performance of word embedding models using an expanded set of synsets from the WordNet database and rank statistics.To analyse the efficiency of the described criterion, a set of word embedding models with varying parameters was trained on the COHA text corpus including text written between 1920 and 2009.We showed the individual impact of certain parameters when training word embedding models using widely used similarity and analogy tests, as well as our own methodology.The results were assessed taking into account the analysis of statistics on 10 models, independently trained with the same parameters.This allowed us to exclude the influence of random factors associated with variations in the initial states of the model during training.The best quality of the model for the COHA corpus data was obtained with a vector dimension of 300 and a context window size of 5.The optimal number of epochs considering the ratio of quality and time costs was 40.It is shown that the proposed methodology for assessing the performance of word embedding models is more effective compared to benchmark similarity and analogy tests in problems of analysing semantic changes.
then both models, on average, cope well with the task of grouping synonyms.Percentage close to 100% would indicate a clear superiority of the first model over the second one; values close to 0 would indicate a clear superiority of the second model over the first one.
-2010 years, 99.3K documents Size of context window 5, 10, 20, 40, 80 (the fixed dimension 300, 40 epochs) Vector dimensions 5, 25, 50, 100, 200, 300, 400 (the fixed window 5, 40 epochs) Number of epochs 10, 20, 40, 80 (the fixed dimension 300 and window 5) Other Minimum count of words -5, Number of negative samples -5, Factor to determine word discard rate -10 -4 visualizes summary statistics with box plots for the CBOW and Skip-Gram models.On each box, the central mark indicates the median, and the bottom and top edges of the box indicate the 25th and 75th percentiles, respectively.The whiskers extend to the most extreme data points excluding outliers; the outliers are marked by '+'.

Figure 1 .
Figure 1.Evaluation results for all models with variable vector dimensions using WordSim-353 and SimLex-999 datasets (a) and WordNet synsets (b)

Figure 2 .
Figure 2. Evaluation results for all models with variable size of the context window using WordSim-353 and SimLex-999 datasets (a) and WordNet synsets (b)