Musical genres: beating to the rhythms of different drums

Online music databases have increased significantly as a consequence of the rapid growth of the Internet and digital audio, requiring the development of faster and more efficient tools for music content analysis. Musical genres are widely used to organize music collections. In this paper, the problem of automatic single and multi-label music genre classification is addressed by exploring rhythm-based features obtained from a respective complex network representation. A Markov model is built in order to analyse the temporal sequence of rhythmic notation events. Feature analysis is performed by using two multivariate statistical approaches: principal components analysis (unsupervised) and linear discriminant analysis (supervised). Similarly, two classifiers are applied in order to identify the category of rhythms: parametric Bayesian classifier under the Gaussian hypothesis (supervised) and agglomerative hierarchical clustering (unsupervised). Qualitative results obtained by using the kappa coefficient and the obtained clusters corroborated the effectiveness of the proposed method.


Introduction
Musical databases have increased in number and size continuously, paving the way to large amounts of online music data, including discographies, biographies and lyrics. This happened mainly as a consequence of musical publishing being absorbed by the Internet, as well as the restoration of existing analogue archives and the advancements of web technologies. As a consequence, more and more reliable and faster tools for music content analysis, retrieval and description are required, catering for browsing, interactive access and music content-based queries. Even more promising, these tools, together with the respective online music databases, have opened new perspectives for basic investigations in the field of music.
Within this context, music genres provide particularly meaningful descriptors given that they have been extensively used for years to organize music collections. When a musical piece becomes associated with a genre, users can retrieve what they are searching in a much faster manner. It is interesting to note that these new possibilities of research in music can complement what is known about the trajectories of music genres, their history and their dynamics [1]. In an ethnographic manner, music genres are particularly important because they express the general identity of the cultural foundations in which they are incorporated [2]. Music genres are part of a complex interplay of cultures, artists and market strategies to define associations between musicians and their works, making the organization of music collections easier [3]. Therefore, musical genres are of great interest because they can summarize some shared characteristics in music pieces. As indicated in [4], music genre is probably the most common description of music content, and its classification represents an appealing topic in music information retrieval (MIR) research. 4 the consensus on some rhythm concepts, there is not a single representation of rhythm that would be applicable for different applications, such as tempo and meter induction, beat tracking, quantization of rhythm and so on. They also analysed the relevance of these descriptors by measuring their performance in genre classification experiments. It has been observed that many of these approaches lack comprehensiveness because of the relatively limited rhythm representations that have been adopted [3].
Nowadays, multi-label classification methods are increasingly required by applications such as text categorization [24], scene classification [25], protein classification [26] and music categorization in terms of emotion [27], among others. The possibility of multigenres classification is particularly promising and probably closer to the human experience.
In the current study, an objective and systematic analysis of rhythm is provided according to single-and multi-label classifications. Our main motivation is to study similar and different characteristics of rhythms in terms of the occurrence of sequences of events obtained from rhythmic notations. First, the rhythm is extracted from MIDI databases and represented as graphs or networks [28,29]. More specifically, each type of note (regarding their duration) is represented as a node, while the sequences of notes define the links between the nodes. Matrices of probability transition are extracted from these graphs and used to build a Markov model of the respective musical piece. Since they are capable of systematically modelling the dynamics and dependences between elements and subelements [30], Markov models are frequently used in temporal pattern recognition applications, such as handwriting, speech and music [31]. Supervised and unsupervised approaches are then applied, which receive as input the properties of the transition matrices and produce as output the most likely genre. Supervised classification is performed with the Bayesian classifier. For the unsupervised approach, a taxonomy of rhythms is obtained through hierarchical clustering. The described methodology is applied to four genres: blues, bossa nova, reggae and rock, which are well-known genres representing different tendencies. A series of interesting findings are reported, including the ability of the proposed framework to correctly identify the musical genres of specific musical pieces from the respective rhythmic information. We complement the findings and results for the single classification with an approach to perform multi-label classification, assigning a piece to more than one genre when appropriate. This paper is organized as follows: section 2 describes the methodology, including the classification methods; section 3 presents the obtained results and a discussion of then; and section 4 contains the concluding remarks and directions for future work.

Materials and methods
Some basic concepts of complex networks as well as the proposed methodology are presented in this section.

System representation by complex networks
A complex network is a graph exhibiting an intricate structure compared to regular and uniformly random structures. There are four main types of complex networks: weighted and unweighted digraphs and weighted and unweighted graphs. The operations of symmetry and thresholding can be used to transform a digraph into a graph and a weighted graph If a i j = 0, i is the predecessor of j and j is the successor of i. Predecessors and successors as adjacent vertices.
Neighbourhood Represented by v(i), Also represented by v(i). meaning the set of vertices that are neighbours to vertex i.

Vertex degree
Represented by k i , gives There are two kinds of degrees: in-degree k in i the number of connected indicating the number of incoming edges; edges to vertex i. It is and out-degree k out i computed as indicating the number of k i = j a i j = j a ji .
outgoing edges: k in i = j a ji k out i = j a i j The total degree is defined as k i = k in i + k out i Average degree Average of k i considering all It is the same for in-and network vertices. out-degrees.
(or weighted digraph) into an unweighted one, respectively [32]. A weighted digraph (or weighted direct graph) G can be defined by the following elements: • Vertices (or nodes). Each vertex is represented by an integer number i = 1, 2, . . . , N ; N (G) is the vertex set of digraph G, and N indicates the total number of vertices (|N (G)|).
• Edges (or links). Each edge has the form (i, j) indicating a connection from vertex i to vertex j. The edge set of digraph G is represented by (G), and M is the total number of edges.
• The mapping ω: (G) → R, where R is the set of weight values. Each edge (i, j) has a weight ω(i, j) associated with it. This mapping does not exist in unweighted digraphs.
Undirected graphs (weighted or unweighted) are characterized by the fact that their edges have no orientation. Therefore, an edge (i, j) in such a graph necessarily implies a connection from vertex i to vertex j and from vertex j to vertex i. A weighted digraph can be represented in terms of its weight matrices W . Each element of W , w ji , associates a weight with the connection from vertex i to vertex j. Table 1 summarizes some fundamental concepts about graphs and digraphs [32]. For weighted networks, a quantity called strength of vertex i is used to express the total sum of weights associated with each node. More specifically, it corresponds to the sum of the weights of the respective incoming edges (s in i = j w ji ) (in-strength) or outgoing edges (s out i = j w i j ) (out-strength) of vertex i.

6
Another interesting measurement of local connectivity is the clustering coefficient. This feature reflects the cyclic structure of networks, i.e. if they have a tendency to form sets of densely connected vertices. For digraphs, one way to calculate the clustering coefficient is: let m i be the number of neighbours of vertex i and l i be the number of connections between the neighbours of vertex i; the clustering coefficient is obtained as cc(i) = l i /m i (m i − 1).

Data description
In this work, four music genres were selected: blues, bossa nova, reggae and rock. These genres are well known and represent distinct major tendencies. Music samples belonging to these genres are available in many collections in the Internet, so it was possible to select 100 samples to represent each one of them. These samples were downloaded in MIDI format. This, event-like format contains instructions (such as notes, instruments, timbres and rhythms, among others), which are used by a synthesizer during the creation of new musical events [33]. The MIDI format can be considered a digital musical score in which the instruments are separated into voices. Each sample in the dataset belongs to a single genre, since, in general, most MIDI databases found in the Internet are still single labelled.
In order to edit and analyse the MIDI scores, we applied the software for music composition and notation called Sibelius (http://www.sibelius.com). For each sample, the voice related to the percussion was extracted. The percussion is inherently suitable to express the rhythm of a piece. Once the rhythm is extracted, it becomes possible to analyse all the involved elements. The MIDI Toolbox for Matlab was used in the present work [34]. This toolbox is free and contains functions to analyse and visualize MIDI files in the Matlab computing environment. When a MIDI file is read with this toolbox, a matrix representation of note events is created. The columns in this matrix refer to many types of information, such as: onset (in beats), duration (in beats), MIDI channel, MIDI pitch, velocity, onset (in seconds) and duration (in seconds). The rows refer to the individual note events, that is, each note is described in terms of its duration, pitch and so on.
Only the note duration (in beats) has been used in the current work. In fact, the durations of the notes, respecting the sequence in which they occur in the sample, are used to create a digraph. Each vertex of this digraph represents one possible rhythm notation, such as quarter note, half note, eighth note and so on. The edges reflect the subsequent pairs of notes. For example, if there is an edge from vertex i, represented by a quarter note, to a vertex j, represented by an eighth note, this means that a quarter note was followed by an eighth note at least once. The thicker the edges, the larger is the strength between these two nodes. Examples of these digraphs are shown in figure 1. Figure 1(a) depicts a blues sample represented by the music How Blue Can You Get by BB King. A bossa nova sample, namely the music Fotografia by Tom Jobim, is illustrated in 1(b). Figure 1(c) illustrates a reggae sample, represented by the music Is This Love by Bob Marley. Finally, figure 1(d) shows a rock sample, corresponding to the music From Me to You by the Beatles.

Feature extraction
Extracting features is the first step of most pattern recognition systems. Each pattern is represented by its d features or attributes in terms of a vector in a d-dimensional space. In a discrimination problem, the goal is to choose features that allow the pattern vectors belonging to different classes to occupy compact and distinct regions in the feature space, maximizing class separability. After extracting significant features, any classification scheme may be used.
In the case of music, features may belong to the main dimensions of music including melody, timbre, rhythm and harmony. Therefore, one of the main features of this work is to extract features from the digraphs and use them to analyse the complexities of the rhythms, as well as to perform classification tasks. For each sample, a digraph is created as described in the previous section. All digraphs have 18 nodes, corresponding to the quantity of rhythm notation possibilities concerning all the samples, after excluding those that hardly ever happen. This exclusion was important in order to provide an appropriate visual analysis and to better fit the features. In fact, avoiding features that do not significantly contribute to the analysis reduces data dimensionality, improves the classification performance through a more stable representation and removes redundant or irrelevant information (in this case, minimizes the occurrence of null values in the data matrix).
The features are associated with the weight matrix W . As commented on in section 2.1, each element in W , w i j , indicates the weight of the connection from vertex j to i or, in other words, they are meant to represent how often the rhythm notations follow one another in the sample. The weight matrix W has 18 rows and 18 columns. The matrix W is reshaped by a 1 × 324 feature vector. This is done for each one of the genre samples. However, it was observed that some samples, even belonging to different genres, generated exactly the same weight matrix. These samples were excluded. Thereby, the feature matrix has 280 rows (all non-excluded samples) and 324 columns (the attributes).
An overview of the proposed methodology is illustrated in figure 2. After extracting the features, a standardization transformation is done to guarantee that the new feature set has zero mean and unit standard deviation. This procedure can significantly improve the resulting classification. Once the normalized features are available, the structure of the extracted rhythms can be analysed by using two different approaches for features analysis: principal components analysis (PCA) and linear discriminant analysis (LDA). We also compare two types of classification methods: Bayesian classifier (supervised) and hierarchical clustering (unsupervised). PCA and LDA are described in section 2.4 and the classification methods are described in section 2.5.

Feature analysis and redundancy removal
Two techniques are widely used for feature analysis [7,35,36]: PCA and LDA. Basically, these approaches apply geometric transformations (rotations) to the feature space with the purpose of generating new features based on linear combinations of the original ones, aiming at dimensionality reduction (in the case of PCA) or to seek a projection that best separates the data (in the case of LDA). Figure 3 illustrates the basic principles underlying PCA and LDA.
The direction x (obtained with PCA) is the best one to represent the two classes with maximum overall dispersion. However, it can be observed that the densities projected along direction x overlap one another, making these two classes inseparable. In contrast, if direction y , obtained with LDA, is chosen, the classes can be easily separated. Therefore, it is said that the directions for representation are not always also the best choice for classification, reflecting the different objectives of PCA and LDA [37]. Appendices A.1 and A.2 give more details of these two techniques.

Classification methodology
Basically, to classify means to assign objects to classes or categories according to the properties they present. In this context, the objects are represented by attributes, or feature vectors. There are three main types of pattern classification tasks: imposed criteria, supervised classification, and unsupervised classification. Imposed criteria are the easiest situation in classification, once the classification criteria is clearly defined, generally by a specific practical problem. If the classes are known in advance, the classification is said to be supervised classification or by example classification, since usually examples (the training set) are available for each class. Generally, supervised classification involves two stages: learning, in which the features are tested in the training set; and application, when new entities are presented to the trained system. There are many approaches involving supervised classification. The current study applies the Bayesian classifier through discriminant functions (appendix B) in order to perform supervised classification. The Bayesian classifier is based on the Bayesian decision theory and combines class conditional probability densities (likelihood) and prior probabilities (prior knowledge) to perform classification by assigning each object to the class with the maximum a posteriori probability.
In unsupervised classification, the classes are not known in advance, and there is not a training set. This type of classification is usually called clustering, in which the objects are agglomerated according to some similarity criterion. The basic principle is to form classes or clusters so that the similarity between the objects in each class is maximized and the similarity between objects in different classes is minimized. There are two types of clustering: partitional (also called non-hierarchical) and hierarchical. In the former, a fixed number of clusters is obtained as a single partition of the feature space. Hierarchical clustering procedures are more commonly used because they are usually simpler [6,7,36,38]. The main difference is that instead of one definite partition, a series of partitions are taken, which is done progressively. If the hierarchical clustering is agglomerative (also known as bottom-up), the procedure starts with N objects as N clusters and then successively merges the clusters until all the objects are joined into a single cluster (please refer to appendix C for more details of agglomerative hierarchical clustering). Divisive hierarchical clustering (top-down) starts with all the objects as a single cluster and splits it into progressively finer subclusters.
Because of the difficulty of defining a clear definition of music genre taxonomy and considering the critical issues regarding the redundancies and fuzzy boundaries discussed in the introduction, we also attempt to classify the pieces using a multi-label classification approach. For the supervised classification, instead of assigning an object only to the class with the maximum a posteriori probability, we proceeded as follows: 1. First, we normalized the maximum a posteriori probabilities of each class, making their sum equal to one. Then, they were sorted in increasing order. 2. After that, given a sample, we computed the distance between its maximum a posteriori probabilities and each one of the vectors: one-label = .0207283]) and the vectors described above, certainly the minor distance will be to the one-label vector, indicating that these samples should be classified only as blues. Now consider sample 5 from the class blues and its maximum a posteriori probabilities: 0.4097450 for the class blues; 0.0079218 for the class bossa nova; 0.5501617 for the class reggae; and 0.0321714 for the class rock. After sorting, the maximum a posteriori probability vector is [0.5501617, 0.4097450, 0.0321714, 0.0079218]. The minor distance will now be to the two-label vector, indicating that this sample should be classified into two classes, namely reggae and blues. Samples with similar maximum a posteriori probabilities reflect a mixture of features from more than one class. The motivation for this approach is to define a criterion in which they can be suitably labeled depending on this similarity degree. For the unsupervised case, we adopted a different approach. From the LASTFM site (http://www.lastfm.com.br), a known interactive web radio, we verified how the genre tags were associated with our samples, in order to set up a multi-label dataset. Actually, we entered each sample into the site and obtained its most used tags (the ones that come together with the music). If among these tags there were any of the four used in this paper (blues, bossa nova, reggae and rock), the sample would be labelled with it. For example, among the most used tags for the music A Mess of Blues by Elvis Presley were the tags blues (the original one) and rock. Therefore, this sample was labeled as blues and rock. This process resulted in the Venn diagram presented in figure 4.
It is interesting to note that bossa nova samples did not have any other tags (blues, reggae and rock) associated with them. Twenty blues samples had a rock tag associated with them: 14

Performance measures for classification.
To objectively evaluate the performance of the supervised classification, it is necessary to use quantitative criteria. The most used criteria are the estimated classification error and the obtained accuracy. Because of its good statistical properties, such as, for example, being asymptotically normal with well-defined expressions to estimate its variances, this study also adopted the Cohen kappa coefficient [39] as a quantitative measure to analyse the performance of the proposed method. Besides, the kappa coefficient can be directly obtained from the confusion matrix [40] (appendix D), easily computed in supervised classification problems. The confusion matrix is defined as where each element c i j represents the number of objects from class i classified as class j. Therefore, the elements in the diagonal indicate the number of correct classifications.
We computed the performance of the multi-label and single unsupervised classifications, taking into consideration the following criterion: Performance = n i /N i , where n i is the number of correct classifications of class i and N i is the number of samples of class i.

Results and discussion
As mentioned earlier, four musical genres were used in this study: blues, bossa nova, reggae and rock. We selected music art works from diverse artists, as presented in tables 2 and 3. Different colors were chosen to represent the genres (the color red for the genre blues, green for bossa nova, cyan for reggae and pink for rock) in order to provide a better visualization and discussion of the results.

Single classification results
If we want to reduce data dimensionality, it is necessary to set a suitable number of principal components that will represent the new features. Not surprisingly, in a high dimensional space, the classes can be easily separated. On the other hand, high dimensionality increases complexity, making the analysis of both extracted features and classification results a difficult task. One approach to obtain the ideal number of principal components is to verify how much of the data variance is preserved. In order to do so, the l first eigenvalues (l is the quantity of principal components to be verified) are summed up and the result is divided by the sum of all the eigenvalues. If the result of this calculation is a value equal to or greater than 0.75, it is said that this number of components (or new features) preserves at least 75% of the data variance, which is often enough for classification purposes. When PCA was applied to the normalized rhythm features, as illustrated in figure 5, it was observed that 15 principal components preserved 76% of the variance of the data. That is, it is possible to reduce the data dimensionality from 364-D to 15-D without a significant loss of information. Nevertheless, as will be shown in the following, depending on the classifier and how the classification task was performed, different numbers of components were required in each situation in order to achieve suitable results. Although preserving only 36% of the variance, figure 5 shows the first three principal components, that is, the first three new features obtained with PCA. Figure 5(a) shows the first and second   In all the following supervised classification tasks, re-substitution means that all objects from each class were used as the training set (in order to estimate the parameters) and all objects  were used as the testing set. Hold-out 70-30% means that 70% of the objects from each class were used as the training set and 30% (different ones) for testing. Finally, in hold-out 50-50%, the objects were separated into two groups: 50% for training and 50% for testing. The kappa variance is strongly related to its accuracy, that is, how reliable are its values. The higher its variance, the lower its accuracy. If the kappa coefficient is a statistics, in general, the use of large datasets improves its accuracy by making its variance smaller. This can be observed in the results. The smaller variance occurred in the re-substitution situation, in which all samples constitute the testing set. This indicates that re-substitution provided the best kappa accuracy in the experiments. On the other hand, hold-out 70-30% provided higher kappa variance, if only 30% of the samples establish the testing set.
The results obtained by the quadratic Bayesian classifier using PCA are shown in table 4 in terms of kappa, its variance, the accuracy of the classification and the overall performance according to the value of kappa. Table 4 also indicates that the performance was not satisfactory for the hold-out 70-30% and hold-out 50-50%. As PCA is not a supervised approach, the parameter estimation performance (of covariance matrices for instance) is strongly degraded because of the small sample size problem.
The confusion matrix for the re-substitution classification task in table 4 is illustrated in table 5. All reggae samples were classified correctly. In addition, many samples from the other classes were classified as reggae.
For comparing the two different classifiers, the results obtained by the linear Bayesian classifier using PCA are shown in table 6, again in terms of kappa, its variance, the accuracy of classification and the overall performance according to its value. The performance of the re-substitution classification task increased slightly, mainly because here a unique covariance matrix is estimated using all the samples in the dataset. Figure 6 depicts the value of kappa depending on the quantity of principal components used in the quadratic and linear Bayesian classifiers. The last value of each graphics makes it clear that from this value onwards the classification cannot be done due to singularity problems involving the inversion of the covariance matrices (curse of dimensionality). It can be observed that this singularity threshold is different in each situation. For the quadratic classifier, this value is in the range of about 5-37 components whereas, for the linear classifier, it is in the range of about 9-106 components. The smaller quantity for the quadratic classifier can be explained by the fact that there are four covariance matrices, each one estimated from the samples for one respective class. As there are 70 samples in each class, singularity problems will occur in a smaller dimensional space (compared to the linear classifier) that uses all 280 samples to estimate one unique covariance matrix. Therefore, the ideal number of principal components allowing the highest value of kappa should be those circled in red in figure 6.
Keeping in mind that the problem of automatic genre classification is a nontrivial task, that in this study only one aspect of the rhythm has been analysed (the occurrence of rhythm notations) and that PCA is an unsupervised approach for feature extraction, the correct classifications presented in tables 4 and 6 for the re-substitution situation corroborate strongly the viability of the proposed methodology. Despite the complexity of comparing different proposed approaches to automatic genre classification discussed in the introduction, these accuracy values are very similar or even superior compared to previous works [3], [10]- [12], [15].
Similarly, figure 7 shows the three components, namely the three new features obtained with LDA. As mentioned before, the LDA approach has the restriction of obtaining only C − 1 nonzero eigenvalues, where C is the number of classes. Therefore, only three components are computed. If it is a supervised approach and the main goal is to maximize class separability, the four classes in figures 7(a) and (b) are clearer than those in PCA, although still involving substantial overlaps. This result corroborates that automatic classification of musical genres is not a trivial task. Table 7 presents the results obtained by the quadratic Bayesian classifier using LDA. In contrast to PCA, the use of hold-out (70-30%) and hold-out (50-50%) provided good results, which is notable and reflects the supervised characteristic of LDA, which makes use of all the discriminant information available in the feature matrix.
Although the values of kappa and its variance obtained using LDA with re-substitution and PCA with re-substitution are similar, the two confusion matrices are slightly different from each other. However, in both cases, the misclassified art works are concentrated in one class, represented by the genre reggae. The results obtained with the LDA technique are particularly promising because they reflect the nature of the data. Although widely used, terms such as rock, reggae or pop often remain loosely defined [3]. Yet, it is worthwhile to remember that the intensity of the beat, which is a very important aspect of the rhythm, has not been considered in this work. This means that analysing rhythm only through notations, as currently proposed,     could pose difficulties, even for human experts. These misclassified art works have similar properties described in terms of rhythm notations and, as a result, they generate similar weight matrices. Therefore, the proposed methodology, although requiring some complementation, seems to be a significant contribution toward the development of a viable alternative approach to automatic genre classification.
The results for the linear Bayesian classifier using LDA are shown in table 9. In fact, they are closely similar to those obtained by the quadratic Bayesian classifier (table 7).
As mentioned in appendix A.2, LDA also allows us to quantify the intra-and interclass dispersion of the feature matrix through functionals such as the trace and determinant computed from the scatter matrices [41]. The overall intraclass scatter matrix, denoted S intra ; the intraclass scatter matrix for each class, denoted S intraBlues , S intraBossaNova , S intraReggae and S intraRock ; the interclass scatter matrix, denoted, S inter ; and the overall separability index, denoted (S −1 intra * S inter ), Two important observations are worth mentioning. Firstly, these traces emphasize the difficulty of this classification problem: the traces of the intraclass scatter matrices are too high, and the trace of the interclass scatter matrix together with the overall separability index, are too small. This confirms that the four classes are overlapping completely. Secondly, the smaller intra-class trace is related to the genre reggae (it is the most compact class). This may justify why, in the experiments, art works belonging to reggae were more frequently 90-100% correctly classified.
The PCA and LDA approaches help one to identify the features that contribute the most to the classification. This is an interesting analysis that can be performed by verifying the strength of each element in the first eigenvectors and then associating those elements with the original features. Within the current study, it was figured out that the first ten sequences of rhythm notations that contributed the most to separation correspond to those illustrated in figure 8. In the case of the first and second eigenvectors obtained by PCA and LDA, the ten elements with higher values were selected, and the indices of these elements were associated with the sequences in the original weight matrix. Figures 8(a) and (b) show the resulting sequences according to the first and second eigenvectors of PCA. The thickness of the edges is set by the value of the corresponding element in the eigenvector. It is interesting that these sequences are the ones that mostly frequently occur in the rhythms from all four genres studied here. That is, they correspond to the elements that play the greatest role in representing the rhythms. Therefore, it can be said that these are the ten most representative sequences that the first and second eigenvectors of PCA take into account. Triples of the eighth and sixteenth notes are particularly important in the genres blues and reggae. Similarly, figures 8(c) and (d) show the resulting sequences according to the first and second eigenvectors of LDA. In contrast to those obtained by PCA, these sequences are not common to all the rhythms, but they must occur with distinct frequency within each genre. Thus, they are referred to here as the ten most discriminative sequences if the first and second eigenvectors of LDA are taken into account.
Clustering results are discussed in the following. The number of clusters was defined as four, in order to provide a fair comparison with the supervised classification results. The idea  behind the confusion matrix in table 10 is to verify how many art works from each class were placed in each one of the four clusters. For example, it is known that from 1 to 70 art works belong to the genre blues. Then, the first line of this confusion matrix indicates that 25 blues art works were placed in cluster 1, 11 in cluster 2, 12 in cluster 3 and 22 in cluster 4. It can also be observed that in cluster 1, reggae art works are the majority (46), whereas in cluster 2, the majority are bossa nova art works (16), despite the small difference compared to the number of blues art works (11); in cluster 3, the majority are rock art works; and in cluster 4, the majority are blues art works.
Comparing the confusion matrix in table 10 and the confusion matrix for the quadratic Bayesian classifier using PCA in table 5, it is interesting to notice that: in the former, cluster 1 contains considerable art works from the four genres (25 from blues, 32 from bossa nova, 46 from reggae and 30 from rock), in a total of 133 art works; in the latter, a considerable number of art works from blues (22), bossa nova (23) and rock (31) were misclassified as reggae, in a total of 146 art works belonging to this class. This means that the PCA representation was not efficient in discriminating reggae from the other genres, while cluster 1 was the one that most intermixed art works from all classes. Figure 9 presents the dendrogram with the four identified clusters. Different colours were used for the sake of enhanced visual analysis. Cluster 1 is shown in cyan, cluster 2 in green, cluster 3 in pink and cluster 4 in red. These colours were based on the dominant class in each cluster. For example, cluster 1 is shown in cyan because reggae art works form the majority in this cluster.
The four clusters obtained are detailed in figures 10-13, in which the legends present the grouped art works from each cluster (blues art works are shown in red, bossa nova art works in green, reggae art works in cyan and rock art works in pink).
As a consequence of working in a higher dimension feature space, the agglomerative hierarchical clustering approach could better separate the data when compared to the PCAand LDA-based approaches, which are applied over a projected version of the original measurements. We computed the performance of the unsupervised classification by using the criterion described in section 2.5.1 and assuming the dominant class in each cluster: reggae in cluster 1, bossa nova in cluster 2, rock in cluster 3 and blues in cluster 4. The performances were as follows: 46/70 = 66% for reggae, 16/70 = 23% for bossa nova, 19/70 = 27% for rock and 22/70 = 31% for blues.

Multi-label classification results
This section presents the classification that is done in a multi-label manner. The principal motivation is to complement the single classification results, considering the discussions about the redundancies of musical genres. The methodology is described in section 2.5.
Despite the fact that the best results were obtained by using the linear Bayesian classifier over 106 components of PCA (table 6) and by using the linear Bayesian classifier over three components of LDA (table 9), we chose to present the multi-label results for supervised classification by using the quadratic Bayesian classifier over the first two components of LDA (with kappa = 0.48), since this is easier to illustrate graphically. Tables 11 and 12 present the obtained labels for each sample. In order to provide a suitable and graphical visualization of such results, figure 14 shows the contour plots of the 2D class conditional Gaussian densities together with the scatter plots of the dataset. The contour plots are obtained by defining equally spaced subintervals within the range [ p min , p max ], where p min is the minimum probability value and p max is the maximum probability value of the distribution. In our case, illustrated in figure 14, the contour plots are given by the central isolines of such intervals (the ones that divide the interval [ p min , p max ] in two halves). It is possible to observe that the labels of each sample are related to their position in the feature space. As an example, consider the blues sample number 30. Although it is a blues sample, its features are more similar to those of the bossa nova samples, that is, its feature vector is located closer to the centre of the reggae conditional density than to the centre of the blues density. Therefore, it is expected that this sample be classified as belonging to the bossa nova class. Samples 56 and 265 were classified as belonging to the four classes. None of the reggae samples were classified as reggae only, since reggae and rock class conditional densities are almost completely overlapping.
The performance of the multi-label unsupervised classification results was computed considering the multi-label dataset presented in section 2.5. For example, instead of considering only the 19 rock samples in the rock cluster, we now verify how many of the 17 reggae-rock samples and how many blues-rock samples there are in the rock cluster, since now they are also referred to as rock samples. The new total of rock samples is 107. The performances are: bossa nova-23%, since no blues, rock or reggae samples were also labelled as bossa nova; reggae-66%, since no blues, bossa nova or rock samples were also labelled as reggae; rock-26% ((19+2+7)/107), since two reggae-rock samples are in the rock cluster and seven blues-rock samples are in the rock cluster; and blues-30% (22/73), since none of the three rock-blues samples are in the blues cluster.

Concluding remarks
Automatic music genre classification has become a fundamental topic in music research since genres have been widely used to organize and describe music collections. They also reveal general identities of different cultures. However, music genres are not a clearly defined concept so that the development of a non-controversial taxonomy represents a challenging, nontrivial task.
Generally speaking, music genres summarize common characteristics of musical pieces. This is particularly interesting when it is used as a resource for automatic classification of pieces. In the current paper, we explored genre classification taking into account the music's temporal aspects, namely the rhythm. We considered pieces of four musical genres (blues, bossa nova, reggae and rock), which were extracted from MIDI files and modelled as networks. Each node corresponded to one rhythmic notation, and the links were defined by the sequence in which they occurred in time. The idea of using static nodes (nodes with fixed positions) is particularly interesting because it provides a primary visual identification of the differences and similarities between the rhythms from the four genres. A Markov model was built from the networks, and the dynamics and dependences of the rhythmic notations were estimated, comprising the feature matrix of the data. Two different approaches for features analysis were used (PCA and LDA), as well as two types of classification methods (Bayesian classifier and hierarchical clustering).
Using only the first two principal components of PCA, the different types of rhythms were not separable, although for the first and third axes we could observe some separation between three of the classes (blues, bossa nova and reggae), while only the samples of rock overlapped the other classes. However, taking into account that 15 components were necessary to preserve 76% of the data variance, it is expected that only two or three dimensions would not be sufficient to allow suitable separability. Notably, the dimensionality of the problem is high, that is, the rhythms are very complex and many dimensions (features) are necessary to separate them. This is one of the main findings of the current work. With the help of LDA analysis, another finding was reached, which supported the assumption that the problem of automatic rhythm classification is no trivial task. The projections obtained by considering the first and second, and first and third axes implied better discrimination between the four classes than that obtained by the PCA.
Unlike PCA and LDA, agglomerative hierarchical clustering works on the original dimensions of the data. The application of the methodology led to a substantially better  discrimination, which provides strong evidence of the complexity of the problem studied here. The results are promising in the sense that each cluster is dominated by a different genre, showing the viability of the proposed approach. The use of a multi-label approach was interesting and particularly appropriate, since it reflected the intrinsic nature of the dataset. Musical genre classification is a nontrivial task even for music experts, since often a song can be assigned to more than a single genre. Automatic classification is considerably more complex, because there are many samples with similar feature vectors, defining huge overlapping areas in the feature space, as observed in our study. In this context, multi-label classification plays a fundamental role, making possible generalization of the genre taxonomy, originally presented in the training set. With our proposed method, new sub-genres (for example, rock-blues) can arise from original ones. Therefore, we observed a significant improvement in the supervised classification performance. For the multi-label unsupervised approach, which does not take the data covariance structure into account, we did not observe significant changes in the classification performances. Furthermore, the labelling process that takes place in the LastFm website, through direct listener interaction, considers all music content (instruments, harmony, melody, pitch, voice and percussion, among others).  Here, our focus is only on the rhythm analysis, which in practical terms reflects a drastic reduction of computational costs, since it is a substantially more compact representation. It is clear from our study that musical genres are very complex and that they present redundancies. Sometimes it is difficult even for an expert to distinguish them. This difficulty becomes more critical when only the rhythm is taken into account.
There are several possibilities for future research implied by the reported investigation. First, it would be interesting to use more measurements extracted from rhythm, especially the intensity of the beats, as well as the distribution of instruments, which is poised to improve the classification results. Another promising area for further investigation regards the use of other classifiers, as well as the combination of results obtained from an ensemble of distinct classifiers. In addition, it would be promising to apply multi-label classification, a growing field of research in which non-disjointed samples can be associated with one or more labels [42]. Another interesting future work is related to the synthesis of rhythms. Once the rhythmic networks are available, new rhythms with similar characteristics according to the specific genre can be artificially generated. the data. Additional information about PCA and its relation to various interesting statistical and geometrical properties can be found in the pattern recognition literature, e.g. [6,7,36,43].
Consider a vector x with n elements representing some features or measurements of a sample. In the first step of PCA transform, this vector x is centered by subtracting its mean, so that x ← x − E{x}. Next, x is linearly transformed to a different vector y that contains m elements, m < n, removing the redundancy caused by the correlations. This is achieved by using a rotated orthogonal coordinate system in such a way that the elements in x are uncorrelated in the new coordinate system. At the same time, PCA maximizes the variances of the projections of x on the new coordinate axes (components). These variances of the components will differ in most applications. The axes associated with small dispersions (given by the respectively associated eigenvalues) can be discarded without losing too much information about the original data.

A.2. LDA
LDA can be considered a generalization of Fisher's linear discriminant function for the multivariate case [7,36]. It is a supervised approach that maximizes data separability, in terms of a similarity criterion based on scatter matrices. The basic idea is that objects belonging to the same class are as similar as possible and objects belonging to distinct classes are as different as possible. In other words, LDA looks for a new, projected, feature space that maximizes interclass distance while minimizing the intraclass distance. This result can be later used for linear classification, and it is also possible to reduce dimensionality before the classification task. The scatter matrix for each class indicates the dispersion of the feature vectors within the class. The intraclass scatter matrix is defined as the sum of the scatter matrices of all classes and expresses the combined dispersion in each class. The interclass scatter matrix quantifies how disperse the classes are, in terms of the position of their centroids.
It can be shown that the maximization criterion for class separability leads to a generalized eigenvalue problem [7,36]. Therefore, it is possible to compute the eigenvalues and eigenvectors of the matrix defined by (S −1 intra * S inter ), where S intra is the intraclass scatter matrix and S inter is the interclass scatter matrix. The m eigenvectors associated with the m largest eigenvalues of this matrix can be used to project the data. However, the rank of (S −1 intra * S inter ) is limited to C − 1, where C is the number of classes. As a consequence, there are C − 1 nonzero eigenvalues, that is, the number of new features is conditioned to the number of classes, m C − 1. Another issue is that, for high dimensional problems, when the number of available training samples is smaller than the number of features, S intra becomes singular, complicating the generalized eigenvalue solution.
More information about the LDA can be found in [7], [35]- [37]. maximum likelihood as follows: Within this context, classification can be achieved with discriminant functions, g i , assigning an observed pattern vector x i to the class ω j with the maximum discriminant function value. By using Bayes's rule, not considering the constant terms and using the estimated parameters above, a decision rule can be defined as follows: assign an object x i to class ω j if g j > g i for all i = j, where the discriminant function g i is calculated as: Classifying an object or pattern x on the basis of the values of g i ( x), i = 1, . . . , C (C is the number of classes), with estimated parameters, defines a quadratic discriminant classifier, quadratic Bayesian classifier or quadratic Gaussian classifier [36].
The prior probability, p(ω i ), can be simply estimated by where n i is the number of samples of class ω i . In multivariate classification situations, with different covariance matrices, problems may occur in the quadratic Bayesian classifier when any of the matricesˆ i is singular. This usually happens when there are not enough data to obtain efficient estimation for the covariance matrices i , i = 1, 2, . . . , C. An alternative to minimize this problem consists of estimating one unique covariance matrix over all classes,ˆ =ˆ 1 = · · · =ˆ C . In this case, the discriminant function becomes linear in x and can be simplified: whereˆ is the covariance matrix, common to all classes. The classification rule remains the same. This defines a linear discriminant classifier (also known as a linear Bayesian classifier or a linear Gaussian classifier) [36]. To show how the objects are grouped, hierarchical clustering can be represented by a corresponding tree, called a dendrogram. Figure C.1 illustrates a dendrogram representing the results of hierarchical clustering for a problem with eight objects. The measure of similarity among clusters can be observed in the vertical axis. The different number of classes can be obtained by horizontally cutting the dendrogram at different values of similarity or distance. Hence, to perform hierarchical cluster analysis it is necessary to define three main parameters. The first regards how to quantify the similarity between every pair of objects in the dataset, that is, how to calculate the distance between the objects. Euclidean distance, which is frequently used, will be adopted in this work, but other possible distances are cityblock, chessboard, Mahalanobis and so on. The second parameter is the linkage method, which establishes how to measure the distance between two sets. The linkage method can be used to link pairs of objects that are similar and then to form the hierarchical cluster tree. Some of the most popular linkage methods are: single and complete linkage, mean linkage and Ward's linkage [6], [44]- [46]. Ward's linkage uses the intra-class dispersion as a clustering criterion. Pairs of objects are merged in such a way to guarantee the smallest increase in the intraclass dispersion. This clustering approach has sometimes been identified as corresponding to the best hierarchical method [47]- [49] and will be used in this work. Actually, it is particularly interesting to analyse the intraclass dispersion in an unsupervised classification procedure in order to identify common and different characteristics when compared with the supervised classification. The third parameter concerns the number of desired clusters, an issue that is directly related to where to cut the dendrogram into clusters, as illustrated by C in figure C.1. computed from the confusion matrix as follows [40]: where x i+ is the sum of elements from line i, x +i is the sum of elements from column i, C is the number of classes (confusion matrix is C × C) and N is the total number of objects. The kappa variance can be calculated aŝ x i+ x +i , (D.3) x ii (x i+ + x +i ) , This statistics indicates that, whenk 0, there is no agreement, and whenk = 1, the agreement is total. Some authors suggest interpretations according to the value obtained by the coefficient kappa. Table D.1 shows one possible interpretation, proposed in [50].