Scaling laws and fluctuations in the statistics of word frequencies

In this paper, we combine statistical analysis of written texts and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies. The average vocabulary of an ensemble of fixed-length texts is known to scale sublinearly with the total number of words (Heaps’ law). Analyzing the fluctuations around this average in three large databases (Google-ngram, English Wikipedia, and a collection of scientific articles), we find that the standard deviation scales linearly with the average (Taylorʼs law), in contrast to the prediction of decaying fluctuations obtained using simple sampling arguments. We explain both scaling laws (Heaps’ and Taylor) by modeling the usage of words using a Poisson process with a fat-tailed distribution of word frequencies (Zipfʼs law) and topic-dependent frequencies of individual words (as in topic models). Considering topical variations lead to quenched averages, turn the vocabulary size a non-self-averaging quantity, and explain the empirical observations. For the numerous practical applications relying on estimations of vocabulary size, our results show that uncertainties remain large even for long texts. We show how to account for these uncertainties in measurements of lexical richness of texts with different lengths.


Introduction
Fat-tailed distributions [1][2][3], allometric scaling [4,5], and fluctuation scaling [6][7][8] are the most prominent examples of scaling laws appearing in complex systems.Statistics of words in written texts provide some of the best studied examples: it shows a fattailed distribution of word frequencies (Zipf's law) [9] and a sublinear growth (as in allometric scalings) of the number of distinct words as a function of database size (Heaps' law) [10,11].The connection between these two scalings is known at least since Mandelbrot [12], and has been further investigated in recent years [13][14][15], especially for large databases [16], finite text sizes [17,18], and more general scaling distributions [19,20].In this paper we report the existence of a third type of scaling in natural language: fluctuation scaling.It appears when investigating the fluctuations around the Heaps' law, i.e., the variation of the vocabulary over different texts of the same size.We show that this scaling results from topical aspects of written text which are ignored in the usual connection between Zipf's and Heaps' law.
The importance of looking at the fluctuations around Heaps' law is that this law is widely used to predict the size of vocabularies [21], e.g., (i) to optimize the memory allocation in inverse indexing algorithms [22]; (ii) to estimate the vocabulary of a language [23,24]; (iii) to compare the vocabulary richness of documents with different lengths [25][26][27].Beyond linguistic applications, scalings of the number of unique items as a function of database size similar to Heaps' law have been observed in other domains, e.g. the species-area relationship in ecology [28,29], collaborative tagging [30], network growth [31], and in the statistics of chess moves [32].These scaling laws have been analyzed from the general viewpoint of innovation dynamics [33] and sampling problems [34].Our results allow for the quantification of uncertainties in the estimation of these scaling laws and lead to a rethinking of the statistical significance of previous findings.
We use as databases three different collections of texts: i) all articles of the English Wikipedia [35], ii) all articles published in the journal PlosOne [36], and iii) the Googlengram database [23], a collection of books published in 1520 − 2008 (each year is treated as a separate document).See Appendix A for details on the data collection.
The manuscript is divided as follows.Section 2 reports our empirical findings with focus on the deviations from a Poisson null model.Section 3 shows how these deviations can be explained by including topicality, which plays the role of a quenched disorder and leads to a non-self averaging process.The consequences of our findings to applications, e.g.vocabulary richness, are discussed in Sec. 4. Finally, Sec. 5 discusses the implications of our main results to other complex systems.

Empirical Scaling Laws
The most-prominent case of scaling in the context of language is Zipf's law [9] which states that the frequency, F , of the r-th most frequent word scales as where the frequency of word r is defined as the fraction of times it occurs in the whole database.Another well-studied example of scaling in language concerns the vocabulary growth known as Heaps' law [10,11], which states that the number of different words, N , scales sublinearly with the total number of words, M , i.e.
with 0 < λ < 1.As a third case, we consider here the problem of the vocabulary growth for an ensemble of texts, and study the scaling of fluctuations by looking at the relation between the standard deviation, σ M = V [N (M )], and the mean value, µ M = E [N (M )], computed over the ensemble of texts with the same textlength M .In other systems, Taylor's law [6] σ with 1/2 ≤ β ≤ 1 is typically observed [8].
The connection between scalings (1) and (2) (Zipf's and Heaps' law) can be revealed assuming the usage of each word r is governed by an independent Poisson process with a given frequency F r .In this description, the number of different words, N , becomes a stochastic variable for which we can calculate the expectation value E [N (M )] and the variance V [N (M )] over the realizations of the Poisson process.The probability that the word with rank r appears at least once in a text of length M is 1 − e −M Fr and therefore Assuming a power-law rank-frequency distribution, F r = cr −α we recover a scaling in the vocabulary growth for M 1, i.e.E [N (M )] ∝ M λ , with a simple relation between the scaling exponents: α = λ −1 [37].
In Fig. 1 we show empirical data of real texts for the scaling relations ( 1)-( 3) and compare them with predictions from the Poisson null model in Eqs.(4,5).The Poisson null model correctly elucidates the connection between the scaling exponents in Zipf's and Heaps' law, but it suffers from two severe drawbacks.First, it is of limited use for a quantitative prediction of the vocabulary size for individual articles as it systematically overestimates its magnitude, see Fig. 1(b,e,h).Second, it dramatically underestimates the expected fluctuations of the vocabulary size yielding a qualitatively different behavior in the fluctuation scaling: whereas the Poisson null model yields an exponent β ≈ 1/2 expected from central-limit-theorem-like convergence [8], the three empirical data [Fig.1(c,f,i)] exhibit a scaling with β ≈ 1.This implies that relative fluctuations of N around its mean value µ for fixed M do not decrease with larger text size (the vocabulary growth, N (M ), is a non-self-averaging quantity) and remain of the order of the expected value.Furthermore, we find that in all three databases the fluctuation scaling approximately gives a quantitative relation between µ M and σ M : Heaps' law can also be constructed by considering the vocabulary growth of a single text as a curve N (M ) for M = 1, 2, ..., M max , where M max is the length of the text.This construction was employed in Fig. 1(e,f) and leads to the same results reported above.In Fig. 1(f) we show that anomalous fluctuation scaling in the vocabulary growth is preserved if shuffling the word order of individual texts.This illustrates that in contrast to usual explanations of fluctuation scaling in terms of long-range correlations in timeseries [8], here, the observed deviations from the Poisson null model are mainly due to fluctuations among different texts.
In the following, we argue that these observations can be accounted for by considering the topical aspects of written language, i.e. instead of treating wordfrequencies as fixed, we will consider them to be topic-dependent (F r → F r (topic)). [Wikipedia] [PlosOne] Scaling of Zipf's law (1), Heaps' law (2), and fluctuation scaling (3).Each row corresponds to one of the three databases used in our work.(a,d,g) Zipf's law: Rank-frequency distribution F r considering the full database (the double power-law nature of the curves is apparent [19]).(b,e,h) Heaps' law: the number of different words, N , as a function of textlength, M , for each individual article in the corresponding database (black dots).(c,f,i) Fluctuation scaling: standard deviation, σ M , as a function of the mean, µ M , for the vocabulary N (M ) conditioned on the textlength M .Poisson (dark line) shows the expectation from the Poisson null model, Eqs.(4,5), assuming the empirical rank-frequency distribution from (a,d,g), respectively.(Data: µ, σ) (pale line) shows the mean, µ M , and standard deviation, σ M , of the data N (M ) within a running window in M (see Appendix A for the details on the procedure).Additionally, (e,f) show the results (Data: µ, σ) when the word order for each individual article is shuffled (dashed-dotted) illustrating that the results are not due to temporal correlations within the text.For comparison we show in (c,f,i) the scalings σ M ∝ µ 1/2 M and σ M ∝ µ M (dotted lines).

Topicality
The frequency of an individual word varies significantly among different texts meaning that its usage cannot be described alone by a single global frequency [38][39][40].For example, consider the usage of the (topical) word "network" in all articles published in the journal PlosOne.It has an overall rank r * = 428 and a global frequency, F r * =428 ≈ 2.9 × 10 −4 , see Fig. 2(a).The local frequency obtained from each article separately varies over more than one decade in Fig. 2(b).
One popular approach to account for the heterogeneity in the usage of single words are topic models [41].The basic idea is that the variability across different documents can be explained by the existence of (a smaller number of) topics.In the framework of a generative model it assumes i) that individual documents are composed of a mixture of topics (indexed by j = 1, .., T ), represented by the probabilities P doc (topic = j); and ii) that the frequency of each word is topic-dependent, i.e.F r (topic = j), which together leads to a different effective frequency in each document, F r,doc = T j=1 P doc (j)F r (j).One particularly popular variant of topic models is Latent Dirichlet Allocation (LDA) [42], which assumes that the topic composition P doc (topic) of each document is drawn from a Dirichlet distribution such that only few topics contribute to each document.Given a database of documents, LDA infers the topic-dependent frequencies, F r (topic), from numerical maximization of the posterior likelihood of the generative model [43].As an illustration, in Fig. 2(c) we show F r * (topic) obtained using LDA for the word "network" in the PlosOne database.As expected from a meaningful topic model, we see that the conditional frequencies vary over many orders of magnitude, and that the global frequency F r * is governed by few topics.

General treatment
In this section we show how topicality can be included in the analysis of the vocabulary growth.The simplest approach is to consider again that the usage of each word is governed by an independent Poisson process, but this time to consider that frequencies are not fixed but are themselves random variables that vary among texts.
In this setting, the random variable representing the vocabulary size, N , for a text of length M can be written as r ) for a given realization j of the set of frequencies F (j) r ; and ii) the average over all possible realizations j of the sets of frequencies F (j) r .In this framework expectation values correspond to quenched averages (denoted by subscript q) where we used The last equation corresponds to the probability of word r not occurring for a Poisson process of duration M with frequency F (j) r , as in Eq. ( 4).For simplicity, hereafter . . .≡ . . .j (the average over realizations of sets of frequencies F (j) r ).Using the inequality between arithmetic and geometric mean, i.e.
we obtain that The right hand side correspond to the result of the Poisson null model (with fixed F r = F r ), see Eq. ( 4), and can be interpreted as an annealed average (denoted by subscript a).This implies that the heterogeneous dissemination of words across different texts leads to a reduction of the expected size of the vocabulary, in agreement with the first deviation of the Poisson null model reported in Fig. 1(b,e,h).
For the quenched variance we obtain (see Appendix B)  5), we see that the quenched average yields an additional term containing the correlations of different words.In general, this term does not vanish and is responsible for the anomalous fluctuation scaling with β = 1 observed in real text, explaining the second deviation from the Poisson null model reported in Fig. 1(c,f,i).

Specific ensembles
In this section we compute the general results from Eqs. (8,13) for particular ensembles of frequencies F (j) r and compare them to the empirical results.To the best of our knowledge, a generally accepted parametric formulation of such an ensemble has so far not been justified by systematic statistical analysis, which is why we propose two nonparametric approaches explained in the following.
In the first approach we construct the ensemble F (j) r directly from the collection of documents, i.e. the frequency F (j) r corresponds to the frequency of word r in document j, such that where D is the number of documents in the data, see Fig. 2(b).
In the second approach we construct the ensemble from the LDA topic model [42], in which F (j) r = F r (topic = j) corresponds to the frequency of word r conditional on the topic j = 1...T , see Fig. 2(c+d).In this particular formulation each document is assumed to consist of a composition of topics which is drawn from a Dirichlet distribution, such that we get for the quenched average in which θ = (θ 1 , ..., θ T ) are the probabilities of each topic, F r (θ) = T j=1 θ j F r (topic = j), and the integral is over a T -dimensional Dirichlet-distribution P Dir (θ|α) with concentration parameter α.We infer the F r (topic) using Gensim [43] for LDA with T = 100 topics.
The results from both approaches are compared to the PlosOne database in Fig. 3.We can see in Fig. 3(a) that both methods lead to a reduction in the mean number of different words.Whereas the direct ensemble, Eq. ( 14), almost perfectly matches the curve of the data, the LDA-ensemble, Eq. ( 15), still overestimates the mean number of different words in the data.This is not surprising since due to the fewer number of topics (when compared to the number of documents) it constitutes a much more coarse-grained description than the direct ensemble.Additionally, the LDA-ensemble relies on a number of ad-hoc assumptions, e.g. the Dirichlet-distribution in Eq. ( 15) or the particular choice of parameters in the inference algorithm which were not optimized here.More importantly, both methods correctly account for the anomalous fluctuation scaling with β = 1 observed in the real data, see Fig. 3(b) and even yield a similar proportionality factor in the quantitative agreement with the data.The comparison of the individual contributions to the fluctuations, Eq. ( 13), shown in the inset of Fig. 3(b) shows that the anomalous fluctuation scaling is due to correlations in the co-occurrence of different words (contained in the term Cov[e −M Fr , e −M F r ]).The deviation from the data for short texts, µ M < 100, is due to finite size effects, which are not captured by the Poisson assumption of word usage.To confirm this, we replaced the Poisson by a multinomial process and obtained an agreement with the empirical observations over the full range of almost four decades in µ M , see Fig. 3(b).

Adding texts
In thermodynamic terms, Heaps' law (as other allometric scalings) implies that the vocabulary size is neither extensive nor intensive (N (M ) < N (2M ) < 2N (M ), also for M → ∞).While this can be seen as a direct consequence of Zipf's law, our results show that the Heaps' law depends also sensitively on the fluctuation of the frequency of specific words across different documents.To illustrate this, consider the problem of doubling the size of a text of size M .This can be done either by simply extending the size of the same text up to size 2M (denoted by M = 2 • M ) or by concatenating another text of size M (denoted by M = 2 × M ).The Poisson model (fixed frequency or annealed average) predicts the same expected vocabulary for both procedures Taking fluctuations of individual frequencies among documents (quenched average) into account yields (see Appendix C for details): Using Eq. ( 10) and the fact that x 2 ≥ x 2 , we obtain the following general result This is consistent with the intuition that the concatenation of different texts (e.g., on different topics) leads to larger vocabulary than a single longer text.The calculations above remain true if the text is extended by a factor k (instead of 2), even for k → ∞.
The fluctuations around the mean show a more interesting behavior as revealed by repeating the computations above for the variance.We consider the case of k texts each of length M , such that M = k × M , and focus on the terms containing correlations between different words shown to be responsible for the anomalous fluctuation scaling (see Appendix C for details): The individual terms can be written as r,r in which • j 1 ,...,j k denotes the averaging over the realizations (j 1 , ..., j k ) of frequencies F is the k-sample average frequency based on the realizations (j 1 , ..., j k ).In the limit k → ∞: for k → ∞.This implies that for k 1 (adding many different texts) the fluctuations in the vocabulary across documents (and therefore the correlations between different words) vanish and normal fluctuation scaling (β = 1/2) is recovered.This prediction can be tested in data.Starting from a collection of documents, we have created a new collection by concatenating k randomly selected documents (each document is used once).We then computed for each concatenated document the number of distinct words N up to size M , for increasing M , and computed E[N (M )] and V[N (M )].We observe a transition of the exponent β in the fluctuation scaling, Eq. ( 3), from β ≈ 1 −→ β ≈ 1/2.

Vocabulary Richness
When measuring vocabulary richness we want a measure which is robust to different text sizes.The traditional approach is to use Herdan's C, i.e.C = log N/ log M [25][26][27].While quite effective for rough estimations, this approach has several problems.One obvious one is that it does not incorporate any deviations from the original Heaps' law (e.g., the double scaling regime [19]).More seriously, it does not provide any estimation of the statistical significance or expected fluctuations of the measure.For instance, if two values are measured for different texts one can not determine whether one is significantly larger than the other.Our approach is to compare observations with the fluctuations expected from models in the spirit of Sec.3.2.
The computation of statistical significance requires an estimation of the probability of finding N different words in a text of length M , P (N |M ), which can be obtained from a given generative model (e.g., the one presented in Sec.3.2).For a text with N * , M * we compute the percentile P (N > N * |M * ), which allows for a ranking of texts with different sizes such that the smaller the percentile, the richer the vocabulary.An estimation of the significance of the difference in the vocabulary can then be obtained by comparison of the different percentile.
For the sake of simplicity, we illustrate this general approach by approximating P (N |M ) by a Gaussian distribution.In this case, the percentile are determined by the mean, µ M = E[N (M )], and the variance, σ M = V[N (M )], from an underlying null model in terms of the z-score which shows how much the measured value N (M ) deviates from the expected value µ M in units of standard deviations (z N (M ) follows a standard normal distribution: ).If we take into account our quantitative result on fluctuation scaling in the vocabulary in Eq. ( 6), i.e. σ M ≈ 0.1µ M , we can calculate the z-score of the observation N (M ) as in which we need to include the expected vocabulary growth, µ M , from a given generative model (e.g., Heaps' law with two scalings [19]).Thus, we obtain a measure from which we can now answer the following questions concerning the vocabulary richness: i) for a single observation N (M ) the z-score, z N (M ) , assigns a value of lexical richness taking into account any deviation from the pure Heaps' law via µ M ; ii) given two observations N 1 (M 1 ) and N 2 (M 2 ), the respective z-scores z N 1 (M 1 ) and z N 2 (M 2 ) can be directly compared for assessing which text has a higher lexical richness independent of the difference in the textlengths; and iii) we estimate the statistical significance of the difference in vocabulary by considering ∆z := z N 1 (M 1 ) − z N 2 (M 2 ) which is distributed according to ∆z d ∼ N (0, 2) since z d ∼ N (0, 1).This implies that the difference in the vocabulary richness of two texts is statistically significant on a 95%-confidence level if |∆z| > 2.77, i.e. in this case there is at most a 5% chance that the observed difference originates from topic fluctuations.As a rule of thumb, for two texts of approximately the same length (N (M ) ≈ µ M ), the relative difference in the vocabulary has to be larger than 27.7%.
We illustrate this approach for the vocabulary richness of Wikipedia articles.As a proxy for the true vocabulary richness, we measure how much the vocabulary of each article, N (M ), exceeds the average vocabulary N avg (M ) with the same textlength M empirically determined from all articles in the Wikipedia.In Fig. 4 we compare the measures of vocabulary richness according to Herdan's C, Fig. 4(a), and the z-score, Fig. 4(b+c).For the latter, we use Eq. ( 24) and calculate µ M from Poisson word usage by fixing Zipf's law and assuming Gamma-distributed word-frequencies, see Appendix D for details.We see in Fig. 4(a) that Herdan's C shows a strong bias towards assigning high values of C to shorter texts: following a line with constant C we observe for M 10 articles with a vocabulary below average while for M > 1000 articles with a vocabulary above average.A similar (weaker) bias is observed in Fig. 4(b) for the calculation of the z-score for the case in which we consider deviations from the pure Heaps' law but treat frequencies of individual words as fixed, i.e. ignoring topicality.The z-score calculations including topicality in Fig. 4(c) show that we obtain a measure of vocabulary richness which is approximately unbiased with respect to the textlength M (contour lines are roughly horizontal).Furthermore, in contrast to the two other measures, we correctly assign the highest z-score to the article with the highest ratio N (M )/N avg (M ).Altogether, this implies that it is not only important to take into account deviations from the pure Heaps' law but that it is crucial to consider topicality in the form of a quenched average.

Discussion
In summary, we have used large text databases to investigate the scaling between vocabulary size N (number of different words) and database size M .Besides the usual analysis of the average vocabulary size (Heaps' law), we measured the standard deviation across different texts with the same length M .We found that the relative fluctuations (standard deviation divided by the mean) do not decay with M , in contrast to simple sampling processes.We explained this observation using a simple stochastic process (Poisson usage of words) in which we account for topical aspects of written text, i.e. the frequency of an individual word is not treated as fixed among different documents.This heterogeneous dissemination of words across different texts leads to a reduction of the expected size of the vocabulary and to an increase in the variance.We have further shown the implications of these findings by proposing a practical measure of vocabulary richness which allows for a comparison of the vocabulary of texts with different lengths, including the quantification of statistical significance.Our finding of anomalous fluctuation scaling implies that the vocabulary is a nonself-averaging quantity, meaning that the vocabulary of a single text is not representative of the whole ensemble.Here we have emphasized that topicality can be responsible for this effect.While the existence of different topics is obvious for a collection of articles as broad in content as the Wikipedia, our analysis shows that we can apply the same reasoning for the Google-ngram data, in which variation in the usage of words is measured at different points in time.This offers a new perspective on language change [44] in which the difference in the vocabulary between written texts from different years can be seen as a shift in the topical content over time.Similarly, other systematic fluctuations (e.g., across different authors or in the parameters of the Zipf's law) can play a similar role as topicality.
Beyond linguistic applications, allometric scaling [4,5] and other sublinear scalings similar to Heaps' law [28][29][30][31][32][33] have been observed in different complex systems.Our results show the importance of studying fluctuations around these scalings and provide a theoretical framework for the analysis.
independent.For the calculation of the expectation of N (M = M 1 +M 2 ) 2 we get higher order terms for r = r : I n (i 1 ) r (M 1 , F (j 1 ) r ) I n r (M 1 , F (j 1 ) r ) i 1 ,i 2 ,j 1 ,j 2 = 1 − e −M 1 F (j 1 ) Generalizing to the concatenation of an arbitrary number of k texts can be treated in the very same way, however, we will only state the result for the case of adding k texts of equal length M such that M = k × M : If we assume that the distribution of frequencies for all words is given by the same shapeparameter a (e.g. a = 1 corresponds to an exponential distribution) and fix the mean of the distribution, given by F r = ab we get e −M Fr = (1 + M F r /a) −a .Assuming a double power-law for the average rank-frequency distribution [19] with parameters γ and r, i.e.F r = Cr −1 for r ≤ r and F r = C rγ−1 r −γ for r > r, where C = C(r, γ) is the normalization constant determined by imposing r F r = 1, we can calculate the vocabulary growth according to Eq. ( 4) analytically in the continuum approximation by substituting x := F r : E q [N (M )] = where the vocabulary growth E q [N (M )] is parametrized by γ, r, and a.
In the limit a → ∞ the Gamma distribution P Γ (F r = x; a, b) with given mean F r = ab = const.converges to a Gaussian with σ 2 = F r 2 /a → 0. For a → ∞, σ 2 → ∞ and we recover the Poisson null model, Eqs.(4,5), in which the individual frequencies F r are fixed (annealed average).
in which n r is the integer number of times the word r occurs in a Poisson process of length M with frequency F r and I[x] is an indicator-type function, i.e.I[x = 0] = 0 and I[x ≥ 1] = 1 .The calculation of the expectation value now consists of two parts: i) the average over realizations i of the Poisson processes n(i) r (M, F

Figure 2 .
Figure 2. Variation of frequencies due to topicality in the PlosOne database.(a) Rank-frequency distribution considering the full data with the location of the word "network" (dotted line) with F r * =428 ≈ 2.9 × 10 −4 .(b) Distribution P (F r * ) of the local frequency F r * obtained from each article separately for the word "network" with the global frequency from (a) (dotted) and the effective frequency F r,doc = T t=1 P doc (t)F r (t) from (c+d) (solid).(c) Topic-dependent frequencies F r * (topic) inferred from LDA with T = 20 topics for the word "network" with global frequency from (a) as comparison (dotted).(d) One realization for the topic composition, P doc (topics), of a single document drawn from a Dirichlet distribution.

=re
−M Fr − e −M Fr 2 + r r =r Cov[e −M Fr , e −M F r ] (13) where Cov[e −M Fr , e −M F r ] ≡ e −M Fr e −M F r − e −M Fr e −M F r .Comparing to the Poisson case in Eq. (

Figure 3 .
Figure 3. Vocabulary growth for specific topic models.(a) Average vocabulary growth and (b) fluctuation scaling in the PlosOne database (Data) and in the calculations from Eqs. (8,13) for the two topic models based on the measured frequencies in individual articles (Real Freq) and on LDA (LDA Freq), compare Eqs.(14,15).The inset in (b) (same scale as main figure) shows the individual contributions to the fluctuations in Eq. (13): r e −M Fr − e −2M Fr (dotted) and r r =r Cov[e −M Fr , e −M F r ] (solid), illustrating that correlations among different words lead to anomalous fluctuation scaling.The lines for LDA-Freq and Real Freq in (b) show the calculations of the corresponding topic models by replacing the assumption of Poisson usage in the derivation of Eqs.(8,13) with multinomial drawing, showing that deviations from the data for µ M < 100 are due to finite-size effects.For comparison we show the results from the Poisson null model (Poisson), Eqs.(4,5), which do not take into account topicality.

Figure 4 .
Figure 4. Measures of vocabulary richness.For 5000 randomly selected articles from the Wikipedia database (black dots), we compute the ratio between the number of different words N (M ) and the average number of different words N avg (M ) (empirically determined from all articles with the same textlength M ).We compare the predictions of different measures of vocabulary richness (solid lines): (a) Herdan's C and (b+c) z-score, Eq. (24), in which we calculate the expected null model, µ M , according to Eq. (D.5) with parameters γ = 1.77, r = 7830[19], and a → ∞ (in b) or a = 0.08 (in c).The solid lines are contours corresponding to values of N (M ) that yield the same measure of vocabulary richness varying from rich (red: C = 0.98 and z = 4) to poor (purple: C = 0.8 and z = −4) vocabulary.The article with the richest vocabulary according to each measure is marked by × (red), showing that there is a clear bias towards shorter texts in (a+b).

r 1 −
e −M Fr k (C.10) V q [N (M = k × M )] = r e −M Fr k − e −2M Fr k (C.11) + r,r e −M Fr e −M F r k − e −M Fr k e −M F r k .Appendix D. Vocabulary Growth for Gamma-distributed frequency and a double power-law Assuming a Gamma-distribution for the distribution of the frequency of a single words among different texts [38] P Γ (F r = x; a, b) = 1 Γ(a) b −a x a−1 e −x/b (D.1) we can calculate the quenched average e −M Fr = dxP Γ (F r = x; a, b)e −M x = (1 + bM ) −a .(D.2)