Computer Assisted Chord Detection Using Deep Learning and YOLOV4 Neural Network Model

Music symbol recognition is an important part of Optical Music Recognition (OMR), Chord recognition is one of the most important research contents in the field of music information retrieval. It plays an important role in information processing, music structure analysis, and recommendation systems. Aiming at the problem of low chord recognition accuracy in the OMR recognition model, the article proposes a chord recognition method based on the YOLOV4 neural network model. First, the YOLOV4 network model is used to train single-voice scores to obtain the best training model. Then, the scores containing chords are trained through neural network fine-tuning technology. The experimental results show that the method recognizes the chords with great results, the model was tested on the test set generated by MuseScore. The experimental results show that the accuracy of note recognition is high, which can reach the accuracy of duration value of 0.96 which is higher than the accuracy of note recognition of other score recognition models.


Introduction
Optical score recognition (OMR) is an important way to realize the digitization of score images, and it has a wide range of application prospects in the fields of computer music, digital music libraries, and computer-assisted music teaching. The usual OMR system mainly includes two links: (1) Recognition of score music symbols, that is, converting score images into a collection of graphic objects with musical semantics (2) Semantic reconstruction of music symbols, and all kinds of symbols are organized according to the organization mechanism of the score Integrate organically. In music scores, music symbols can be divided into musical notes and various notation marks. Therefore, the recognition of notes is the recognition of music scores. However, the notes are usually densely arranged, diverse in shape, and small in size, which makes it difficult to recognize the notes.
The role of Optical Music Recognition (OMR) is to effectively convert the score image into the score information [1][2][3][4]. The score information is in an editable or playable form, such as MIDI (for playing) And MusicXML (for page layout). Pacha et al. [5] integrated all data sets in the OMR research field and extracted a classification model based on convolutional neural networks. This method is dedicated to training a general musical notation classifier. The experimental results show that the accuracy of the classifier reaches 95%. And Pacha et al. [6]  learning models (namely, two-stage Faster R-CNN detector, one-stage Retinanet detector, and U-Net) and determined their performance according to a common evaluation standard. Their experimental results can be used as a baseline for general detectors in music symbol detection tasks. Tuggener et al. [7] connected ResNet-101 with the RefineNet up-sampling network as a score feature extractor, and combined with the border detection method to identify various symbols. The test results in the DeepScores and MUSCIMA++ [8] data sets show that for the middle and high scores The recognition effect of high-frequency music symbols is good, but the recognition effect of non-high-frequency symbols is not ideal; Huang Zhiqing et al. [9] used darknet53 as the backbone network for note feature extraction and then combined feature fusion technology to propose a note recognition model. The recognition accuracy of is 0.87, but the model is limited to single notes. When there are chord notes in the score, the model will have problems such as missing recognition.Van der Wel [10] and others divide the score with staff as the unit, encode it into a sequence vector through a convolutional neural network, and combine it with a recurrent neural network to identify the pitch and time value of the notes in the sequence. The experimental results the accuracy of notes is 0.76, This method needs to divide the music score into rows before entering the model and cannot recognize dense symbols.
The current deep learning-based note recognition has achieved preliminary results. The above methods are only aimed at recognizing single-voice scores, and they cannot recognize dense music symbols, tuplets, etc. However, most of modern music scores are polyphonic scores, which contain a large number of chords. Therefore,This paper proposes a deep learning-based score recognition model for printed scores, that is,the model can accurately identify musical scores and chords.

Dataset
This paper uses DeepScoresV2 [11] as the training data set,an extended and improved version of the original DeepScores [12] dataset that specifically addresses these issues and makes the following contributions: (a) add 20 formerly absent classes including symbols without fixed size or shape but are nonetheless fundamental to music notation, thereby increasing the list of musical symbols that can be detected by 23%; (b) add ground truth for oriented bounding boxes, thus enabling research into detectors with potentially much higher precision; (c) add ground truth for further higher-level musical semantics, therefore making the dataset valuable for tasks beyond pure music object detection downstream in OMR.We selected 20,000 sheet music pictures, of which 70% were used to train the model, 15% were used to verify the model, and the remaining 15% were used to test the model.

Data Augmentation
Since the score images collected by the DeepScores V2 dataset are all data under ideal conditions, the trained model lacks generalization ability. In order to make the trained model robust to score images in low-quality and multi-scenarios, this paper proposes four data enhancement methods to simulate noise in different situations. In this paper, Gaussian noise, Gaussian blur, color transformation, rotation and stretching and other elastic deformations are added to the original score image to simulate several undesirable situations that may appear in the score image. Figure 1 (a) simulates the impact of different lighting conditions on the score image by changing the hue, brightness and saturation of the original score image; adding Gaussian noise can simulate the low-quality printing or scanning of the score image, the results are as follows Figure 1(b) shows; Figure 1(c) performs elastic transformations such as stretching and rotation on the original image, which can simulate the folding and distortion of the music score during the scanning process; Figure 1(d) adds Gaussian blur to make The music score produces a partial color reduction effect, which simulates the deterioration of the layout of the music score when stored for a long time.

Implementation Details
In the music score image, the shape of the chord is shown as a dense head, that is, there are a large number of detection targets in a small area, which brings a huge challenge to the target detection algorithm. In order to identify small objects such as musical notes, we use CSPDarknet53 as the backbone to improve the size of the receptive field and we also use the FPN network structure for multi-channel feature fusion.

Detection Model
This paper uses the backbone network of cpsdarknet53 [13], combined with feature fusion technology (SPP [14], show in Fig3) to detect chord. The intuitive diagram of the network is shown in Fig 2, the following is the overall process of a musical score data image passing through the network: a: FPN is to use the pyramid form of CNN hierarchical features, while generating feature pyramids with strong semantic information at all scales. The network is designed with a top-down structure and horizontal connections to integrate a shallow layer with high resolution and a deep layer with rich semantic information. In this way, it is possible to quickly build a feature pyramid with strong semantic information at all scales from a single input image at a single scale.A complete sheet music picture is directly input into the FPN network to obtain multi-layer features.
b: High-level neurons mainly respond to the entire note, and other neurons are more inclined to respond to the local texture information of the note. This network structure enhances the positioning capability of the entire feature structure by propagating strong low-level responses. Because the strong response to the edge and part of the instance is a strong beacon for accurately positioning the instance.Therefore, the horizontal connection is used to connect the low-level and high-level information, so that the network can more accurately identify each note. d: Regression,used to recognize note coordinates. e: Fully-connection,use to calculate the confidence of the note.

Fig. 3
A network structure with a spatial pyramid pooling layer

Loss function
The loss function of the model has three parts: the coordinate loss function of the prediction box, the category loss function predicted by the prediction box, and the confidence loss function of the prediction box.Compared with the yolov3 network model, the yolov4 network optimizes the position loss function of the prediction box. Yolov4 uses generalized IoU [14] loss, and ciou considers the scale information of the overlap, center distance and aspect ratio of the frame on the basis of iou,its loss function is as formula one. (1)

Evaluation Method
The evaluation criteria for chords have been re-established. In chords, if the score detection model can detect each note head, the model can accurately detect the chord. If there are one or more missed detection, the model is deemed to be unable to detect this chord. Since the chord contains a large number of note heads, detecting all the dense note heads is undoubtedly a huge challenge, and the final chord accuracy will also be low.
•detection:Indicates that the chord can be detected successfully.
•miss detection:indicates that there are undetected note-head.
• lack detection:indicates that no chords can be detected at all.

Network Training
The entire training process is completely end-to-end, directly inputting the score image, calculating the loss function through the model, and finally optimizing the model parameters through the loss function. The batch size of the training model using the stochastic gradient descent optimizer is 32, the initial learning rate is 0.001, the learning rate decays constantly, and the learning rate is halved every ten cycles. After about 40 cycles, the model began to converge, and the model was trained in about 6 hours. Table 1 shows the recognition results of chords,There are a total of 23,251 chords, 12,442 can be fully recognized.

Table. 1 Chord Detection Results
From the chord detection results, it can be seen that the accuracy of this model is 53.5%, the missed detection rate is 44.9%, and the missed detection rate is 1.5%. Since each note in the chord is too dense, the receptive field of the note will be very small, and an accuracy of 53.5% is an acceptable result. In the aspect of chord detection, more in-depth experiments are still needed to improve the detection effect of chords.

Conclusion and Further Research
In this work, we proposed a complete printed score recognition model based on a deep convolutional neural network. The system only needs to input a complete score to be able to output the category of the symbols. In future work, we plan to use more advanced networks to detect chords.