How Fast Do Distribution and Semantics of Polysemic Words Change?

Creation of large diachronic text corpora triggered studies of evolution of natural languages using quantitative methods in the last decade. For the first time, it became possible to raise questions about general patterns of lexical semantic change. In the work by W. Hamilton et al., 2016, a hypothesis was formulated about a more rapid change in the meanings of polysemic words. In this paper, we consider how frequency of words influence variations of semantic metrics. We use explicit word vector representations and Jensen-Shannon divergence as a measure of change in the distribution and semantics of a word. The obtained results show that changes in the frequency ratio of a word in different meanings can themselves lead to a change in a word distribution, even in the absence of any changes in its meaning. Therefore, the observations made in the work by W. Hamilton et al about correlation between the rate of change of words and their local cluster coefficient in the semantic connection network (considered as a quantitative measure of polysemy) can be fully explained by the effect described in the article.


Introduction
Word meaning can change due to various factors reflecting changes that take place in a language and in society.Over the years, semantic changes have attracted scientists` attention and triggered the development of methodology for studying semantic change.A big leap in the study of semantic shifts occurred at the beginning of the 21st century, when extra-large text corpora began to be created and efficient semantic vector models were developed.
Creation of new research tools allowed one to investigate not only changes in meanings of individual words but also, for the first time, to set the task of revealing laws and regularities of lexical semantic change.Recently, a number of works have appeared in which various hypotheses about such regularities are put forward.
For example, the work [1] proposes the "law of prototypicality".According to this law, wordsprototypes containing in semantic fields change more slowly than peripheral words within these fields.The work [2] considered the evolution of semantics in groups of synonyms and studied two competing trends (towards differentiation of synonyms and towards their convergence).Having analyzed the empirical data, the authors concluded that the trend towards the convergence of synonyms prevails.
Two laws are proposed in [3]: "the law of conformity" (frequent words change more slowly), and "the law of innovation" (polysemic words change faster than monosemic words of the same frequency).
An interesting result was obtained in [4], it was found that verbs change faster than nouns.Also, the same work studied how the rate of change in the semantics of nouns and verbs changes over time.The authors argue that despite some fluctuations over the past 100 years, it has not changed significantly.

IC-MSQUARE-2023
Journal of Physics: Conference Series 2701 (2024) 012099 IOP Publishing doi:10.1088/1742-6596/2701/1/012099 2 Results of this kind, relating not to individual words but to the entire lexicon of a language, cannot be obtained by conventional methods without involving large text corpora and modern computer text processing technologies.
However, many of the results obtained in [4] and other earlier works were argued in [5].The authors [5] point out the need to take into account frequency of words when making comparisons because almost all the employed metrics depend on word frequency.It is also noted that the use of methods that utilize one or another variant of dimensionality reduction (SVD, Skipgram models, etc.) can cause the appearance of artifacts during the analysis of semantic changes.The work [6] also obtained results partially contradicting the "law of prototypicality".Therefore, there is a need for further research in this area.
In this paper, we will consider the law of innovation hypothesis formulated in the paper [3] cited above.To validate it, it is necessary to develop a methodology suitable for such a task.In [3] there is no discussion of linguistic aspects, and any change in the distribution of a word is identified with a change in semantics.However, change of word distribution does not always indicate meaning change.For example, there is a word ride which have several meanings including to travel on.Considering that meaning, one may say that you can ride a car, ride a bike, ride the bus etc.The distribution is different; however, the meaning is the same -to travel on [7].
There are some other questionable issues in the methodology used in [3].The first one is that the local cluster coefficient is regarded as an indicator of polysemy confirmed by only a few examples for the extreme values of the cluster coefficient.The second arguable issue is that all parts of speech are analyzed together, including a number of functional words that are regarded as polysemic ones because they have the highest values of the local cluster coefficient (yet, also, always, etc.).
In our work, we do not aim at solving the problem whether semantics of polysemic words changes faster than that of monosemic ones.We will consider the impact of changing the frequency of such words (used in one or another of their meanings) on quantitative metrics of semantic changes.We will show that such changes in the frequency of words can lead to a change in the distribution of the word even in the absence of actual changes in its meaning.

Related works
To study semantic change, the distributional hypothesis is used, which states that word distribution can be used to estimate its meaning due to correlation between distributional similarity and meaning similarity [8,9,10].There are different algorithms of distributional meaning acquisition.Early works mainly used representations based on co-occurrence vectors [11,12,13,14].In [15], it was proposed to use vectors built from Point Mutual Information (PMI).Various ways for reducing the dimension of vector representations were also considered, for example, involving the use of SVD [16,15].An improved word embeddings technique based on neural networks [17,18] was proposed in 2013 and gave new impetus to research.The most state-of-the-art model of the semantic representation of words BERT (see for example [19,20]) was used in [21].An overview of the current state of research on lowdimensional word embeddings can be found in [22,23,24,25].Now, methods that are based on vector models of neural networks are the most widely used for semantic shift studies.However, simpler representations based on explicit word vectors are also employed to solve various linguistic problems including semantic change.
Hamilton et.al. [3] compared various methods and concluded that SVD and Skipgram models provide better results than PPMI vectors.On the other hand, it is noted in [5] that artifacts may appear during the semantic change analysis if one or another variant of dimensionality reduction (SVD, Skipgram models, etc.) is used in the applied method.
The article [26] performed an analysis based of the UK Web Archive corpus with the size of approximately one terabyte.It is noted that the method based on the direct use of co-occurrence vectors requires fewer computing resources than word embeddings.Applying the latter to extra-large corpora can cause difficulties.In addition, the direct use of co-occurrence vectors has some advantages.Firstly, the data obtained using it are easy to interpret.Secondly, it allows easily matching across different time intervals.The disadvantage of the method of the direct use of co-occurrence vectors that is usually mentioned is high dimension of the vectors.

Vector Representation of Words
To build the required vectors, we used word and word combination frequency data extracted from the Google Books Ngram [27,28] corpus.The third version of the English (common) corpus was launched in 2020.It includes texts of 16.6 million books published within 1470-2019, which contain approximately 2000 billion words.The Google Books Ngram corpus provides frequency data for 1-, 2-, 3-, 4-, and 5-grams.Data on the frequency of 5-grams are used in some works to train neural network vector models for solving various problems.For example, Kulkarni et al. [29] use these data to study word meaning change.However, it should be mentioned that the corpus does not include n-grams, the total frequency of which for the entire period is less than 40.Thus, some rare words may lack 5-, 4-, or even 3-grams in the database.Therefore, in this paper, we use data on the frequencies of 2-grams, pairs of consecutive words.
To represent semantics of a word, we use explicit word vectors.For each word, we select all bigrams occurring in the corpus that include this word.The distribution of the word in one or another time interval is quantitatively characterized by a vector of frequencies of all bigrams that include it.It is reasonable to normalize the resulting vectors to sum 1 because they represent the probability distribution of using a word in the context of various combinations.As mentioned above, a number of works note that employing methods that utilize dimensionality reduction (for example, SVD, etc.) can cause the appearance of artifacts when quantifying changes in word distribution (see, for example, [5]).Therefore, no dimensionality reduction algorithms were used.In addition, since there is no need to compare vectors of different words with each other in our work, in most cases dimension of the vectors is not very large.Jensen-Shannon divergence [30] was used as a metric to quantify the difference in vectors describing words in different time intervals.

Modelling of Polysemic Words
Let us consider the following model.Assume that there is some word with two or more meanings.Let us suppose that the distribution of the word used in each of these meanings does not change with time, that is, the relative probabilities of the word used in a particular context for each individual meaning are unchanged.However, the ratio of the frequency of use of the word in its various meanings may change.In this case, the observed change in word co-occurrence can be significant, even the word semantics have not changed.
To quantify this effect for polysemic words, a disambiguated corpus is needed.However, such corpora are usually of a small size.As for a large corpus of Google Books Ngram, study of changes in the distribution of polysemic words is complicated because only frequencies of words and word combinations are freely available not the original texts of the books.An alternative approach involves the use of statistical modelling.
To quantify the effect, we modelled vectors of polysemic/homonymous words.To perform this, pairs of words were randomly selected from the nouns found in the corpus.Then, these words were redesignated in the corpus by a single lexeme.Words obtained in such way we will call chimera words.We can extract statistics on the frequency of their use in the context of various word combinations and build a vector representing them.Thus, vectors of polysemic words with a known ratio of frequencies of use in different meanings were modelled.
Words for the modelling experiments were chosen as follows.Firstly, we considered only nouns.To select nouns, we used the POS tags available in the second and third version of the Google Books Ngram corpus.The words tagged in the corpus nouns in at least 90% of cases were selected.Further, the most interesting is the case when the words that make up the chimera word are approximately comparable in frequency.Therefore, we divided the list of all nouns into classes depending on the total frequency of each word in 1800-2019.The class number for the i-th word is found as the integer part of the binary logarithm of the word frequency   (that is, as ⌊log 2   ⌋).In each class, 200 words were randomly selected, which were then combined into 100 pairs.

Use of Bootstrapping
Estimations of semantic differences, as a rule, have a significant variance.To obtain more reliable results, as well as to find confidence intervals for the obtained estimates, it was proposed in [31] to use bootstrapping.In accordance with this approach [31], to model the empirical frequencies of use of a word in various contexts, a certain time interval is selected during which the frequency distribution is considered unchanged.Next, for each context c, we extract (from the corpus) the frequency of use  , () of the i-th target word in this context for different years from the selected interval.After normalizing the frequencies to the corpus size   , we obtain a series of relative frequencies for M years of the selected interval: To build the frequency vector, we independently randomly choose one of these numbers for each context c.The resulting vectors model the natural variability of frequencies of word combinations from year to year.
Having obtained two sets of vectors generated in this way for the two compared time intervals, we calculate in pairs the chosen difference measure between them (for example, Jensen-Shannon divergence).Next, we calculate the average value of pairwise JSD values and their standard deviation.
To solve our task, this algorithm needs to be slightly modified.The expression for the frequency  , () is the following: Here   is a corpus size in a year t,   () -relative frequency of the i-th target word,  | () -relative frequency (conditional probabilities) of the context c for the i-th word: To generate frequencies  , (), we will separately generate the series   () and  | (), and then, calculate the frequencies using formula (1).Let  = { 1 ,  2 } be the set of words used to create a chimera word.To simulate the frequencies of use of the word-chimera in context c, we use the expression: As a result of the modelling of the frequencies using formulas (2,4), we can obtain fractional numbers.In this case, it is necessary to round the resulting number down.

Accounting for the impact of word frequency
It is known that for all types of word embeddings, semantic difference estimates strongly depend on word frequency [3,5,32,31,33].As pointed out in [5], this effect must necessarily be taken into account when comparing the rate of semantic changes for different words.In [3], an explicit parametric model (power-law dependence) is used to eliminate the dependence of the compared quantities on frequency.In such case, the correctness of the conclusions made in the work depends on the validity of the assumption put forward.Note that there is no theoretical argument in favor of the power-law dependence in the literature for this case, and the results of model experiments also do not confirm it (see [33]).
In [6], to eliminate the effects of frequency when comparing pairs of synonyms, it was proposed to use a reference group, the words similar in frequency that belong to the same part of speech.In this work, we also used this approach.Suppose we are evaluating the difference between vectors representing words in two chosen time intervals.In this paper, we analyze changes from decade to decade, so consider an example where the first interval is 1990-1999, and the second is 2000-2009.Let us calculate the total frequencies of words for the first and second intervals taken together (that is, in our example in 1990-2009), and sort the list of nouns in descending order of the total frequency.Next, for each target word or chimera word, 100 nearest neighbor words were selected (50 above and 50 below in the obtained list).The rate of change for the target word is compared with similar values for the obtained group of words similar in frequency.
Quantitatively, the ratio of changes in the target word and in the words of the reference group can be characterized in two ways: • Calculate the percentage of words in the reference group for which the estimated value is less than that of the target word.A value of about 50 percent would be a sign that there are no significant differences for this word from other words of similar frequency.A value significantly higher than 50 percent would indicate that the estimated value is much greater than for similar words.
• Calculate the median ratio of the estimated value for the target word to the same value for the words of the reference group.Further, we use both methods.

Results
We use data from the Google Books Ngram corpus for 220 years (1800-2019).We divide this interval into 22 ten-year intervals, 1800-1809, 1810-1819, ... 2010-2019.We consider two cases: • Over the period under consideration, both distribution of words in each of the meanings and relative frequencies of the words used in each of their meanings change.To model this case in accordance with formula (4), resampling of both the relative frequencies of the word combinations  | () and relative frequencies of the words   () is performed for each decade (Case 1).• Over the period under consideration, relative frequencies of words used in each of their meanings change, however, the distribution of words in each of the meanings does not change separately.To model this case, the relative word frequencies   () are independently resampled for each decade.The relative frequencies of word combinations  | () for both decades are randomly selected in this case from the same set (the set of empirical values for both of the compared decades) (Case 2).Then, for each chimera word, the measure of difference (Jensen-Shannon divergence) between vectors of words in the neighbouring decades is calculated.Let us first compare the values of the measure of difference for these two cases with each other.For 100 chimera words (of each frequency class and each decade), we calculate the ratio of JSD in Case 1 to JSD in Case 2. Figure 1 shows the geometric mean values of this ratio for chimera words of various frequency classes in different decades.The dependence of the average ratio on the time observed in the figure is due to the fact that the size of the Google Books Ngram corpus depends on time, it has been growing rapidly.Accordingly, the absolute frequencies of words for different decades change.Next, let us compare the difference measure values for chimera words with similar difference measure values for randomly selected words with similar frequency (see Section 3.4).For each chimera word and each word of the reference group, a measure of difference (Jensen-Shannon divergence) between vectors of words in the neighbouring decades is calculated.Then, first, we calculate the ratio of JSD for the chimera word to the JSD for each of the words in the reference group.Figure 3,A shows the geometric mean of this ratio (for Сase 1) for words of each frequency class in different decades.The red curve shows the isoline of level 1.As can be seen, the values of the geometric mean ratio less than 1 are found only for rare words in the first half of the 19th century when the size of the corpus was relatively small.In other cases, the ratio significantly exceeds 1.Also, for each chimera word, we calculate the percentage of referential words for which JSD is less than for the chimera word.Figure 3,B shows the average percentage for words of each frequency class in different decades.The red line is the 50% level isoline.
Figure 4A shows boxplots of the ratio of JSD for chimera words to the JSD for words of the reference group, depending on the current frequency of the chimera words.The boxplots for Case 1 and Case 2 are shown in different styles.As can be seen, except for the rarest words (with current frequencies of the order of 10 2 ), the JSD ratio is significantly higher than 1, both for Case1 and Case 2.  4b shows boxplots of the percentage of cases in which the JSD for chimera words is greater than the JSD for referential words.Except for the region of the lowest frequencies, one can see that the percentage of such cases is above 50% for most of the chimera words.At that, the differences between Case 1 and Case 2 are relatively small.
The number of words in the reference group for which the JSD is higher than the JSD for the chimera word will be denoted by n, and the percentage of such words by -p.If the null hypothesis that the chimera words do not differ from the reference words in relation to the rate of change is true, then n has a uniform distribution in the range from 0 to N, where N is the number of words in the reference group.Under this assumption, it is easy to calculate the standard deviation of p: The standard deviation  〈〉 for the value obtained by averaging over M chimera words will be equal to: For example, these parameters are N=100, M=100 for figure 3,B.In this case,  〈〉 is equal to 2.9%.As for Figure 4,B, the diagram is built on the basis of a different number of samples in different frequency ranges, therefore, the values of  〈〉 will be different.The dependence of  〈〉 on frequency is shown in Figure 4,C.For example, for frequencies of approximately 2.6 10 5 , the median p values are 84.1% and 80.6% for Case 1 and Case 2, respectively.In this case,  〈〉 for this frequency range is 0.662%.Therefore, statistical significance of the observed effect is beyond doubt.Thus, the calculations show that the distribution of chimera words changes much faster than that of words of the same part of speech with similar frequencies (for Case 1 and Case 2).

Conclusion
The paper considers the impact of changes in frequency of polysemic words used in various meanings on the quantitative metrics of semantic changes.To perform numerical calculations explicit word vectors were used as representation of word semantics.Vectors representing polysemic words were simulated by synthesis from vectors corresponding to randomly selected pairs of nouns.For the fictitious chimera words obtained in this way, change in distribution from decade to decade was quantified using the Jensen-Shannon divergence as a measure of difference.
The performed calculations show that the distribution of chimera words changes much faster than that of words of the same part of speech with similar frequencies.Moreover, we also modelled the situation when the distribution of a word in each of its meanings does not change separately, but the ratios of frequencies of use of the word in its different meanings (Case 2) change.It is interesting that in this case it also turns out that the distribution of chimera words changes much faster than that of words with similar frequencies.The above results show that even for high-frequency words, up to 80% of the observed values of the measure of semantic difference (JSD) can be explained by a change in the ratio of frequency of use of the word in its different meanings.All this shows that changes in the frequency ratio of a word in different meanings can themselves lead to a change in a word distribution, even in the absence of any changes in its meaning.Therefore, using an uncritical approach to the interpretation of numerical indicators, a false conclusion can be made about changes in the semantics of the word.
In particular, observations made in [3] about correlation between the rate of change of words and their local cluster coefficient in the semantic connection network (considered as a quantitative measure of polysemy) can be fully explained by the effect described above.Accordingly, the thesis [3] that words that are more polysemous have higher rates of semantic change ("law of innovation") cannot be considered proven at the moment.To confirm or refute this thesis, it is necessary to carefully consider the formulation of the problem, both in terms of the criterion for selecting polysemic / monosemic words, as well as in terms of how to assess the rate of semantic changes.Effects associated with changes in word frequencies should also be taken into account and excluded.
It seems that for neural network vector representations (Skipgram, CBOW models) a similar effect can be observed, expressed in changes in the vector representing a polysemic word when the ratio of frequencies of its use in different meanings changes.However, for models like Skipgram or CBOW, it is more difficult to synthesize model vectors that simulate the case when relative frequencies of use of words in each of their meanings change.However, the distribution of words in each of their meanings separately does not change.This may be the subject of further research.It is also of interest to check whether a similar effect is observed when applying contextualized models (such as ELMo, BERT).

Figure 1 .
Figure 1.Geometric mean value of the ratio of the JSD in Case 1 to the JSD in Case 2 for chimera words of different frequency classes in different decades

Figure 2 .Figure 2
Figure 2. A) A scatterplot of the ratio of JSD in Case 1 to the JSD in Case 2 for chimera words of various frequencies; B) boxplots of the indicated ratio for different frequency classes Figure 2 shows distribution of the values of the ratio of JSD in Case 1 to the JSD in Case 2 from the current frequency of chimera words.By the current frequency of a word, we mean frequency of this word within a time interval for which the comparison is performed.For example, if we analyze change in distribution of a word from the decade 1990-1999 to the decade 2000-2009, then the frequencies are calculated for the time interval 1990-2009.On the right, the data are presented as a scatterplot; on the left, the same data are shown in the form of boxplots.As can be seen, the typical values of the ratio in the entire frequency range from 10 2 to 10 8 lie in the range from 0.8 to 1.This means that even for highfrequency words, up to 80% of the observed values of the measure of the semantic difference (JSD) can be explained by a change in the ratio of the frequency of use of the word in different meanings.

Figure 3 .
Figure 3. A) Geometric mean of the ratio of the JSD for chimera words (Сase 1) to the JSD for reference words for different frequency classes in different decades; B) Average percentage of cases in which the JSD for chimera words (Сase 1) is higher than the JSD for reference words

Figure 4 .
Figure 4. A) Boxplots of the value of the ratio of the JSD for chimera words of various frequencies to the JSD for referential words; B) Boxplots of the percentage of cases in which the JSD for chimera words is greater than the JSD for referential words; C) Standard deviation of the mean percentage for words of different frequencies Similarly, Figure 4b shows boxplots of the percentage of cases in which the JSD for chimera words is greater than the JSD for referential words.Except for the region of the lowest frequencies, one can