Comparing intermittency and network measurements of words and their dependency on authorship

Many features from texts and languages can now be inferred from statistical analyses using concepts from complex networks and dynamical systems. In this paper we quantify how topological properties of word co-occurrence networks and intermittency (or burstiness) in word distribution depend on the style of authors. Our database contains 40 books from 8 authors who lived in the 19th and 20th centuries, for which the following network measurements were obtained: clustering coefficient, average shortest path lengths, and betweenness. We found that the two factors with stronger dependency on the authors were the skewness in the distribution of word intermittency and the average shortest paths. Other factors such as the betweeness and the Zipf's law exponent show only weak dependency on authorship. Also assessed was the contribution from each measurement to authorship recognition using three machine learning methods. The best performance was a ca. 65 % accuracy upon combining complex network and intermittency features with the nearest neighbor algorithm. From a detailed analysis of the interdependence of the various metrics it is concluded that the methods used here are complementary for providing short- and long-scale perspectives of texts, which are useful for applications such as identification of topical words and information retrieval.


Introduction
The application of ideas from statistical physics to text analysis has a long tradition since Shannon's usage of entropy as the central concept in information theory [1]. In recent years, physicists have proposed new approaches based on concepts from complex networks [2,3,4,5,6,7,8,9,10,11,12,13,14,15] and dynamical systems [16,17,18,19,20]. In the former, text is represented as complex networks with words (nodes) being connected (links) using procedures depending on their syntactic or semantic relationships [2]. Several of these networks share topological properties such as the scaling in the degree [3,4] and the small world feature [5,6]. The co-occurrence networks, where adjacent words are linked to each other, are probably the most popular for applications owing to their ability to capture important syntactic and semantic aspects of texts with a straightforward construction procedure. These networks were employed to evaluate writing quality [7] and machine translations [8,9], to generate and evaluate summaries [10], to construct spell checkers [11], to recognize patterns in poetry [12] and prose [13,14] and to study general properties of written language [15]. While cooccurrence networks focus mainly on short scales, an increasingly popular approach addresses longer text scales [16,17,18,21,22,23,24]. The usefulness of this latter approach stems from the finding that topical words are unevenly distributed along the text when compared to a random process or to function words. This observation can be quantitatively investigated using different analogies and measures familiar to the communities of statistical physics and dynamical systems, including level statistics [22,24], burstiness [17,18,19], entropy [21], and intermittency measures [20]. The author dependency on the features mentioned above has been noticed [14,17], but little work has been devoted to quantify the extent of this dependency and to test its usefulness to the automatic detection of authors.
In the field of authorship recognition (or stylometry), one tries to identify the author of documents whose identity is lacking [25]. Some simple quantitative proposals, such as the use of word length to distinguish between authors, go back to the mid 19th century (see Ref. [26] for a historical account). One important recent contribution was given by Mosteller and Wallace [27] who showed that the frequency of function words (such as "any", "from", "an", "there" and "upon") can be used to characterize the style of authors. This feature is so strong that even letter pair frequencies can provide a good distinguishability between authors [26]. Frequent words are also responsible for the success of the approach -proposed and investigated by physicists -that consist in quantifying the similarity between two books based on the distance between their word-frequency rankings [28,29,30]. More recently, new features have been proposed: word length, sentence length; frequency of punctuation marks and contractions; frequency of graphemes, collocations and words. A summary of these recent results is given in Ref. [31].
In this paper we investigate how the metrics of complex networks and intermittency -familiar to physicists -depend on the style of authors. We start quantifying the metrics for each word (Sec. 2). These are used in the definition of global features for each book that are tested according to their efficiency in algorithms of authorship classification (Sec. 3). Finally, we discuss the importance of each feature (Sec. 4). The primarily goal of this paper is not to improve state-of-the-art methods of automatic authorship recognition. Instead, we wish to estimate the dependency on authorsihp of the selected metrics. We evaluate our results using authorship classification tests because they provide a statistical rigorous method to quantify the importance of different features. Nevertheless, our study reveals interesting insights which are potentially useful in real applications and therefore we include a comparison to more traditional statistical natural language methods (that increase the efficiency from 62.5% to 90.0% correct attributions, see Sec. 4.2).
2 Statistical quantification of the role of words in texts 2.1 Database Our database contains 5 novels from each of 8 authors who lived between 1809 and 1975, which are available in an online repository (http://www.gutenberg.org/). The list of books is summarized in Sec. 1 of the Supplementary Information (SI). To avoid effects from the length of the texts, each book was limited to their first 18,200 tokens, which corresponds to the length of the shortest book. In the remainder of this section we illustrate our results using the book "The Adventures of Sally", by P. G. Wodehouse. The results for all books appear in (SI)-Sec. 3 and are discussed in Sec. 3 below.
Pre-processing of the text.
Before extracting complex networks and intermittency measurements from the texts, some preprocessing steps were applied. Initially, a pre-compiled list of stopwords including articles, prepositions and adverbs were removed from the text (see SI-Sec. 2). Previous work for recognizing authorship used the frequency of function words, but we decided not to use them in our study because we are interested in the interrelation between words with a pronounced semantic content. This procedure has been employed in many works (see e.g. Refs. [7,8,9,12,15]) and it is crucial to determine how these techniques depend on the writing style of each author. Next, a lemmatization step was applied to the remaining words using the MXPost part-of-speech Tagger based on the Ratnaparki's model [32]. Table 1 exemplifies the application of these pre-processing steps. With this standardization, we grouped together all words referring to a same concept, despite the differences in flexion.

Network measurements
Complex networks have been used to characterize different properties of languages [5,6,7,8,9,10,14,33]. Here we are interested in author-specific characteristics and therefore we adopt a network description based on word co-occurrence [5,7,8,9,14,33], where nodes are words and links are established between subsequent words. This procedure is illustrated in Fig. 1. The network is defined by a set V = {v 1 , v 2 , . . . v n } of vertices, and a set E of edges and is represented as a nonsymmetric weighted matrix W . By construction, W is a square matrix of size n, where n is the number of distinct words after the pre-processing step. The elements w ij of W indicate the strength Table 1: Example of the pre-processing steps applied to the texts. An extract (first column) obtained from the book "The Adventures of Sally", by Pelham Grenville Wodehouse is shown after the removal of the stopwords (second column) and after lemmatization (third column i.e. the number of times word v j appears immediately after word v i . Additionally, we used the non-weighted and undirected network corresponding to W , denoted by the matrix A whose elements a ij = 1 if the words represented by the vertices v i and v j appeared as neighbors at least once in the text. Otherwise, a ij = 0. We shall use the statistical properties of complex network measurements in the networks W and A. In this section we discuss the word-specific local measurements and in Sec. 3.1 we show how to connect them to obtain a global characterization of the network. The number of occurrences N i of each word i represented by node v i is where s in and s out are respectively the weight (or strength [34,35]) of the ingoing and outgoing edges of node v i 1 . Therefore, the degree of each node v i is the frequency of appearance of word v i and the degree distribution of W is proportional to the normalized frequency (f i = N i /N T , N T = i N i ) distribution of words, given by the Zipf's law [36,37]. Below we discuss three typical measurements: clustering coefficient, average shortest path length, and betweenness centrality.
Clustering Coefficient C.
The clustering coefficient (C) measures the probability that the neighbors of a given vertex v i are connected. This measurement has been widely employed in complex networks, e.g. to verify the presence of communities [38,39,40] and to distinguish random networks from other small world networks [5,41]. Traditionally, the clustering coefficient is defined without considering weights or directions as: which is equivalent to the fraction of the number of triangles among all possible triads of connected nodes, and therefore ranges from 0 to 1. With regard to the interpretation of this measure, Ferrer   Table 1; and (b) the global network obtained from the first 1, 000 associations of the same book.
i Cancho and Solé [5] found that the clustering coefficient of networks representing text was much larger than the one expected just by chance (i.e., the value expected for the corresponding random networks).
We singled out the words (and their neighbors) with the highest and lowest values of clustering coefficient in the book "The Adventures of Sally", by P. G. Wodehouse, which are shown in Table  2 for the frequency N = 5 2 . From the definition, one expects that words with highest C to have neighbors also connected to each other. This is the case of the words "sand" and "excitement". On the other hand, words whose neighbors are not related to each other at all display low values of C (e.g. there is no link between the neighbors of "full" or between the neighbors of "high"). Qualitatively, the clustering coefficient quantifies how words are connected to specific contexts. Indeed, the words "sand" and "excitement" tend to be more restricted to a specific context, while "full" tends to appear in a myriad of contexts. Therefore, it seems that the clustering coefficient can be useful to detect authorship by quantifying the tendency of using semantic-specific or generic words.

Average Shortest Path Length L.
A shortest path (or geodesic path) between two nodes is defined as the path whose sum of edge weights is minimum. We start defining d ij as the length of the shortest path between v i and v j (in this case A is employed). Then the average shortest (or geodesic) path length for v i (L i ) is the average shortest path to all other (n-1) nodes of the network: Words of the book "The Adventures of Sally" with the highest and lowest clustering coefficients (the average clustering C = 0.085), for words with N i = 5. The five words with C = 0 were randomly selected among the 18 words with N = 5 and C = 0. The words with the lowest L include the characters "Sally" (L = 2.35, N = 347) and "Fillmore" (L = 2.51, N = 138), in addition to high-frequency words, such as "say" (L = 2.45, N = 349), "good" (L = 2.46, N = 107) and "man" (L = 2.50, N = 193). As for the words with the highest L, we found: "white-clad" (L = 6.33, N = 1), "affability" (L = 6.31, N = 1), "whirl" (L = 5.89, N = 1), "jazz" (L = 5.87, N = 1), "war-aims" (L = 5.84, N = 1). Interestingly, all these 5 words appeared only once in the text, indicating that one of the reasons for a high L could be the low frequency N . However, L is not only a consequence of the frequency N of the words, as low frequency words can also take low values of L. This is illustrated in Table 3, which compares words with the same N but different L. The frequency has a limited influence on L, with a Pearson correlation Corr(L,N ) = -0.36 calculated over all words. Actually, the determining factor is the neighborhood of the word. To understand why this happens, consider the words "affability" and "repose". While the former has as neighbors the words "jaunty" (N = 1) and "white-clad" (N = 1), the latter has as neighbors the words "Sally" (N = 347) and night (N = 20). Therefore, one may infer that L actually quantifies the importance of a word according to its distance to the most frequent words. Since we removed stopwords, the shortest path may be thought of as quantifying the distance from a word to the core-content words of the book. Table 3: Comparing the average shortest path length L for words with the same frequency N of the book "The Adventures of Sally". For a given N , L may vary widely, which shows the dependency of L on the neighborhood connectivity.
Betweenness B is a measurement of centrality, with higher values being assigned to the nodes considered as the most relevant in terms of linking different words. In other words, with B one attempts to quantify the frequency of access of each node, assuming that a given target node in the network is reached from a specific source node via shortest paths. Betweenness is defined as follows. Let η i st be the number of distinct shortest paths between the source node v s and the target node v t that pass through the node v i . If g st is the total number of shortest paths between v s and v t , then B i is given by: In the context of text analysis, high frequency words tend to have high B. However, some words may play the role of articulation points by linking concepts related to distinct communities. To illustrate this, we show in Table 4 that words with similar N may take very different B. A comparison between the left and right columns suggests that words with high B connect concepts because of their probable appearance in various contexts. Therefore, similarly to the clustering coefficient C, the betweenness centrality B seems to quantify the variety of contexts in which a word can appear. Note, however, that B is based on a global connectivity pattern, in contrast to C. In the book "The Adventures of Sally", by P. G. Wodehouse, there are N T = 15, 173 words (tokens), 3, 657 different word types, and 716 words with N i ≥ 5. The 5 words with highest σ T /T are shown in the left part of the table. For comparison, in the right we show for each of these words another word with the closest frequency. Word

Intermittency measurements
The uneven distribution of words across different documents is an essential feature exploited in Statistical Natural Language Processing. For instance, by investigating words appearing over concentrated in specific documents (when compared to their overall frequency) one can detect keywords, topics, and authorship [27,42]. This is the basic idea of the term frequency -inverse document frequency (TF-IDF) and related measures that are also at the core of search engines [42]. However, there are numerous situations where the comparison to a general database is not available or is not interesting. For instance, when authorship has to be attributed without previous knowledge of texts written by the potential authors. Here we approach these problems by taking advantage of the finding that words are unevenly distributed not only across documents but also within them [16,17,18,21,22,23,24]. The quantification of the uneven distribution of words has been proposed based on measures commonly used by physicists [16,24]. Following Refs. [21,22,24], we use the statistics of recurrence times, a standard quantification of intermittency or burstiness in time series [18,19]. In texts, time is counted by the number of words and for each word i the recurrence time T j is defined as the number of words between two successive occurrences of i (the j and j + 1 occurrence) plus one. For instance, the recurrence times for the word "the" in the previous sentence are T 1 = 9 and T 2 = 7. A word that appears N i times in a text of size N T leads to a sequence of N T − 1 inter-occurrence times {T 1 , T 2 , ..., T N T −1 }. In order to incorporate also the time until the first T f and after the last T l occurrence of the word, we consider T N = T f + T l . In this case T = N T /N i , where the overline denotes average over the different T j 's. Note that the mean recurrence time T i gives no additional information than the frequency N i . The intermittency of the word appears in the variance of T j 's around T and can be quantified by in the text have I = 1 (in the limit of large N and small N i /N T ), intermittent words have I > 1, and words appearing in regular intervals have I < 1. We calculate the intermittency measure I i = σ T /T for all words with N i ≥ 5 in each of the books (filtered texts) described in Sec. 2.1. The words with N i < 5 were considered to lack statistics and were disregarded. In Table 5 we compare words with highest I = σ/T to words with similar frequency. It is clear that the most intermittent words (largest σ T /T ) are topical words (e.g., name of characters and locations), regardless of their frequency. Indeed, 15 out of the 16 most intermittent words. are directly connected to specific characters. A similar behavior is observed in all books of our database. Intermittency is therefore a good characterization of topical words that in turn plays an important role in the author-specific characteristic of the texts. The relationship between σ T /T and the function of the words has been investigated in detail in Refs. [16,17,18,21,22,24]. In the next section we explore the fact that these properties are also author specific [17].
3 Evaluating the author dependency

From properties of words to properties of books
In the previous section we introduced five quantities characterizing properties of words in the text: frequency (N ), average shortest path length (L), betweenness (B), clustering coefficient (C), and intermittency (I = {σ T /T }). The values of these quantities for all words in the books in our database can be found in SI-Sec. 3. We now analyze the global distribution of these measurements for all the words in a given book by plotting the empirical probability density function ρ(X) for the measurements X = {N, L, B, C, I}. Fig. 2 shows the results for one book, and similar distributions were obtained for the other books. The shortest path L, clustering C, and intermittency I have a well defined peak and width (akin to a Gaussian distribution), but the frequency N and betweenness B have broad tail distributions (as in power law distributions ρ(X) ∼ X −α ). The tail in N corresponds to the well-known Zipf's law, which also appears in B as expected from the large correlation between B and N (Corr(B,N ) = 0.95 in the book of Fig. 2). With the two different behaviors we propose two sets of measurements, one for X = {L, C, I} and another for X = {N, B}.
Our goal is to obtain quantities characterizing important features of these distributions to be used as global measurements of the books. The most natural choice is the average value X , where . . . ≡ 1 M M i=1 . . . corresponds to an average over the M different words. For the network measures L, C, I this corresponds to the average values over nodes, a quantity considered as characteristic of the network [7,12,14,43]. For X = {N, B}, the highly frequent words contribute strongly to X due to the long tails. To compensate for this effect, we consider also a modified average defined as X 2 ≡ log X for X = {N, B}. For X = {L, C, I} the opposite is true, i.e., X is dominated by the large number of low frequency words. Accordingly, we introduce a modified average as X 2 ∝ i X i log N i , i.e., a weighted average with weights proportional to the logarithm of the frequency. The quantities X and X 2 are expected to give a good account of "typical" values of X. However, in Sec. 2 we mentioned that important information is conveyed by words with large X, i.e., in the tails of the distributions shown in Fig. 2. In order to characterize the fat-tail distributions of X = {N, B}, we used the coefficient α X of a power-law fit to the tails of ρ(X) 3 . An additional motivation for using α N comes from the suggestion in Ref. [45] that it serves as a quantification of the style of texts. The large values of X = {L, C, I} were characterized by calculating the skewness of ρ(X), a measure of the asymmetry of the distribution. In summary, the three features we use for each of the five quantities X = {N, B, L, C, I} are: Average value: X for X = {N, B, L, C, I}.
Modified average: Right tail: These features are given in Fig. 2 for one book (see SI-Sec. 3 for all 40 books). Obviously, the choice of the quantities above is inevitably arbitrary. Our choice was intended to capture features of the distribution, rather than giving a parametric description of the full distribution. In particular, the power-law fit in Eq. (7) does not intend to fully describe the distributions, as apparent in Fig. 2(d,e).

Machine learning methods and evaluation
In order to quantify the ability of the features described above to distinguish between authors, we employ machine learning algorithms which induce classifiers from a training database. The robustness of our results is tested with three widely used algorithms based on different principles. The first is known as C4.5 [46], and generates decision trees based on the information gained by each feature; the second algorithm is the Naive Bayes [47], which is based on the Bayes theorem; and the third and simplest algorithm is the Nearest Neighbor [48], which classifies an unknown instance according to the nearest neighbor of that instance in a normalized space involving all features. For more details, see SI-Sec. 4.

Efficiency of the classification
We consider the problem of distinguishing between 8 authors, using five books to represent each author's style. More specifically, each book described in Sec. 2.1 was characterized by the set of 15 features discussed in Sec. 3.1 ( X , X 2 and γ(X) for X = {N, B, L, C, I}). The authorship assignment was performed using the algorithms in Sec. 3.2 applied to a training dataset independent of the test book using the cross validation methodology (see SI-Sec. 4). This technique ensures that the training and evaluation sets are different and it is equivalent to assigning the authorship of one book in an experiment where 4 books of 8 authors were used as a training dataset. The final output of the algorithms is the assignment of a specific author to each book tested, and the efficiency is quantified simply as the fraction of successful assignments. The results are summarized in Table 6 and indicate accuracy rates between 42.5% and 50.0% when all 15 features were used. These results were statistically significant by a large amount, confirming that these features successfully capture author specific characteristics. To further explore the   6: Accuracy rate achieved for the three machine learning algorithms using all 15 features and the best combination of these features. The accuracy is estimated based on 40 authorship assignments. The p-values correspond to the probability of getting by chance a higher or equal accuracy in one (all features) and in 2 15 = 32, 768 (best case) trials. The features included in the best cases can be found in SI-Tables S1-S4

Relative importance of different features
In evaluating the importance of the different features on the final results it is essential to identify their mutual dependency. We start from the list of all 2 15 = 32, 768 combinations of features ordered by decreasing accuracy (as shown in SI-Tables S1-S4). We wish to quantify when feature y appears in the top of this list. To this end, we count the fraction of the 2 14 feature combinations that include y with accuracy higher or equal to a threshold. The final Estimate is then given by the area-under-the curve of the ROC plot obtained by varying the threshold [49]. This procedure is equivalent to the Mann-Whitney U test [50]. The motivation for using this method is that it evaluates the importance of a specific feature by taking into account how it combines with the other features to improve the accuracy of the prediction. The method depends both on the prediction algorithm and on the other features.
The features were ranked based on the method described above. The results for the 3 prediction algorithms are given in the three first columns of Table 7. The three features appearing as the most prominent are N (average frequency), γ(I) (skewness of intermittency), and N 2 (average logarithmic frequency). In order to state the importance of features beyond specific algorithm it is important to quantify in which extent the results obtained for the three algorithms (first 3 columns) are consistent with each other. To this end we compute the Spearman's rank correlation and obtain the values 0.29 (p-value = 0.145), 0.49 (p-value= 0.032) and 0.67 (p-value= 0.003) for the pairs C4.5/kNN, C4.5/Bayes and kNN/Bayes, respectively. The p-values are computed under the null hypothesis that the rankings are independent. Altogether, the three p-values indicate that the three rankings are consistent with each other. This is a strong indication that our analysis goes beyond algorithm-specific results and indeed captures the influence from the features.
It is interesting to compare the results to evaluations taking into account each feature separately. This can be done either by considering the accuracy of the prediction using only the specific feature or by comparing the information gained by including the feature [51]. This last method has the advantage of being independent of the prediction algorithm. These results are shown in the 4 last Table 7: Ranking of features based on the accuracy rate of the classifiers, where 1 in the table means best, 2 second best and so on. The results for each classifier algorithm (C4.5, kNN and Bayes) are reported using different ranking procedures combined with multiple features (Mann-Whitney U test, columns 1-3, information gain (column 4) and accuracy using only one feature (columns 5-7)). The last column reports the Pearson correlation between each feature and the vocabulary size M (number of different words) calculated over the 40 books in our database. The features in the table are ordered according to the decreasing geometric mean of the ranks obtained in the 3 multiple features analysis (this ordering is the same achieved by considering for each feature the likelihood of reaching by chance a ranking as good as the one in each of the three ranking schemes). The areas under the curve in the multiple features analysis ranged between 56% and 69%.
Multiple  Table 8: List of the 2 most relevant combinations of features, as revealed by a Factorial Analysis for the C4.5, kNN and Bayes classifier algorithms. As expected from Table 7, γ(I) provides good results when used in conjunction with other features, such as N , N 2 γ(B) and I 2 .
C4.5 kNN Bayes 1st Combination γ(L), C and C 2 γ(I) and γ(B) N , N 2 and B 2nd Combination γ(I) and N γ(I) and N 2 I 2 and γ(I) columns of Table 7. Note that some features appearing as very important in the multiple features analysis are not informative when taken alone (e.g., the skewness of the intermittency γ(I)). On the other hand, features that are well ranked in the single feature analysis do not always appear among the most important features when multiple features are considered (e.g., the weighted average of the clustering C 2 ). These observations show the nontrivial mutual dependency of the features. To further explore this we performed a factorial analysis (see SI-Sec. 5) using the 12 most important features in Table 7, with the most important combinations being summarized in Table 8. As expected from Table 7, in fact γ(I) appears among the 2 best combinations of features in all three algorithms, which confirms that its effectiveness is correlated with its interdependence with other features.

Discussion and conclusions 4.1 Interpretation of the results
We are now in a position to use the word-specific analysis (Sec. 2) and the distribution (Sec. 3.1) of the quantities X = {N, B, L, C, I} to assess the importance of the different features X , X 2 , and γ(X) in Table 7: N frequency. This was the most efficient quantity for recognizing authorship with N and N 2 among the 3 most important features. Noting that N is proportional to the inverse number of distinct words M : one infers that the distinguishing feature between the authors captured by N is the different vocabulary sizes. The modified average N 2 = log N also captures this aspect, including the proportion of frequent and infrequent words. On the other hand, the poor performance of γ(N ) (= α in ρ(N ) ∼ N −α ) is a clear signature of the universal, author-independent, character of Zipf's law (at fixed book size [37]).
B betweenness. The average betweenness B was useful because of its strong correlation with the vocabulary size of the book M (last column in Table 7). In network terms, this corresponds to a linear relationship between B and network size (M , number of nodes) and can be understood by noting that the number of terms in the sum of definition of B i in Eq. (4) is proportional to M 2 , so that B is expected to scale linearly with M for a fixed book size. The fact that B 2 and γ(B) had a poor performance indicates that the number of words with high betweenness is not a relevant feature to distinguish between authors.
L shortest path. This was the network quantity with best performance. L quantifies the typical distance of words to the central hubs of the network (frequent words). The good performance points to a dependence on the style of the authors. The poorer performance of L 2 and γ(L) indicates that the style dependency in L is more prominent in the typical values than, respectively, in the frequent and large L words.
C clustering. The poor performance of all values related to this quantity suggests that authors have very little freedom in choosing the clustering of words co-occurrence networks. The last position in the ranking of γ(C) in Table 7 suggests that the fraction of words used in specific contexts (high C) is author independent. The two averages C 2 and C take similar values (as seen in Fig. 2 and in SI-Sec. 3, recall the restriction N i ≥ 5 used in Sec. 2.2). They perform well only when used alone, possibly because of their correlation to vocabulary size M .
I intermittency. Apart from the frequency, intermittency was the most important quantity in Table 7 with the skewness of the distribution γ(I) playing a prominent role. In view of the results in Sec. 2.3, γ(I) may be interpreted as the fraction of all words that are topical or "keyword like". The poor performance of I is not surprising since I is normalized by frequency (I ≡ σ T /T ) and therefore I ≈ 1 is expected. Indeed, from all 5 quantities the I−features have shown altogether the smallest absolute value of correlation with vocabulary size (Tab. 7), explaining why even γ(I) has a poor relative performance when used alone. Finally, I 2 performs better than I suggesting that frequent words are the more relevant ones.

Comparison with other prediction methods
Even if the main goal of this paper is to evaluate the importance of different factors, it is also useful to compare the accuracy of our results with other methods of authorship attribution. Uzuner and Katz [52] used a database of books similar to ours, produced by 8 authors. They used five sets of features, including simple statistics and more sophisticated syntactic analysis (Table 3 of Ref. [52]). Our best results (accuracy of 65%) is comparable to their second best case obtained using "syntactic elements of expression" (62%), being significantly worse only than their best result, achieved using function words (87%). In an extensive review, Grieve [31] reported accuracies obtained with a set of 34 features varying between 33% and 87% for the case of 5 authors, and between 18% and 80% for 10 authors (Table 9 in Ref. [31]). Our best results are above the median of their results achieved by using different features. Their best results again are based on the relative frequency of function words. These results are in accordance with the long tradition started by Mosteller and Wallace to use the frequency of function words to distinguish between authors [27].
In order to confirm this in our database, we implemented a series of prediction schemes using the frequency N i of frequent (mostly function) words. Differently from the approach described in this paper that used average and scaling properties as features, now the frequencies of specific words are used directly as input features of the prediction algorithms. We used only the kNN algorithm because the other algorithms did not provide good results when too many features were included. When the list of 70 stopwords from Table 2.5 of Ref. [27] was used, we obtained an accuracy of 62.5%, i.e., comparable to our best results. Following Ref. [31], we considered two other lists of words: all 1, 978 words that appear in at least one book of each of the authors, leading to accuracy of 90%; and all 209 words that appear at least once in every book in our database, leading to an accuracy of 82.5%. We recall that in order to concentrate on analysis that focus on words with pronounced semantic content instead of function words, we have deliberately excluded a list of stopwords that comprised 80% of the cases listed in Ref. [27]. Therefore, our best combination of features compares well to other methods which demand more sophisticated syntactic analysis of the text. Measurements of complex network and intermittency are indeed able to capture many of the author-dependent characteristics.
In order to illustrate how measures analyzed in this paper can be complementary to traditional methods we have performed a very simple experiment using as features the frequency and intermittency of the set of words composed by the five most frequent words in each book. The accuracy in classifying the books only by the frequency was 72.5 % and only by intermittency was 37.5 %. Although this last accuracy rate is not impressive, it is statistically significant (p = 2.2 10 −4 ) and shows that the intermittency values of specific words across distinct authors is different. The accuracy is increased to 80 % when both features were included. It remains to be shown in future works how our results can improve state of the art methods of authorship attribution.

Summary of conclusions
We have shown that the style of different authors leave fingerprints in very general statistical measures of texts based on the network of co-occurrence of words and on intermittency or burstiness of words. The statistically significant scores obtained in authorship attribution unequivocally show that the style dependence of these features can be used in practice. Regarding the prominence of the different features, we note that both the results and ranking of features may depend on the database, selected features and attribution algorithms. Accordingly, as emphasized in Ref. [31], different algorithms and features have to be tested in a given corpus before any real application of authorship attribution. However, the robustness of our results using three radically different attribution algorithms strongly suggests that the different features have importance that go beyond specific algorithms. Two features should be highlighted: (i) the skewness of the distribution of intermittent words γ(I), which is based on the long-scale distribution of words and detects the extent into which topical words (keywords, large I ≡ σ T /T ) were used in the book; and (ii) the mean shortest path of the word co-occurrence network L , which is based on the short-range connectivity of words and detects the typical distance of words to all other words. The different natures of these two quantities suggest a complementary role for capturing both short-and long-scale properties of the text, as well as typical and exceptional words.
Our focus in this paper was on the evaluation of the different features, rather than on maximizing the efficiency of the authorship attribution algorithms. This is apparent when comparing the best accuracy rates we achieved using our approach (62.5%) and using previous proposals (90.0%), as discussed in Sec. 4.2. A further limitation of approaches based on intermittency and networks is that they require large pieces of text. While the root of the success of previous methods rely on the observation that function words are a powerful tool to detect the style of authors [27,31,52], in the complex network and intermittency approaches used in this paper the focus is on the content words. In this sense the results we achieve can be thought as being complementary to the analysis using function words. More specifically, our results suggest that using γ(I) and L can improve authorship recognition techniques when used in combination with the many different features currently employed [31]. Finally, the successful application of these measurements to characterize the style of authors suggests that the quantities discussed here can be further explored in other linguistic tasks, an approach that has been limited to a few works (see e.g. [14,53]).