Classification of Clavicle Fractures based on Multi-View Fusion

Clavicle fracture is a common shoulder injury. Clinical Allman classification divides clavicle fracture into middle fracture, distal fracture and proximal fracture. Different fracture types have corresponding treatment methods and different healing standards. The diagnosis of clavicle fractures can be misdiagnosed and missed by doctors due to blurring of the fracture line. In order to improve the diagnostic efficiency of clinicians and provide clearer treatment ideas, this paper establishes a two-stage clavicle-assisted diagnostic model. The first stage is based on 3D U-Net to segment the shoulder CT of normal clavicle and clavicle fracture in 3D, with dice coefficient reaching 0.9441, and then calculates the two-dimensional image information entropy of the image to select the key layers of the clavicle for classification of the segmented 3D image. The second stage of classification was performed to fuse the key layers data under the three views. The experimental results showed that the three-view fusion had a higher classification accuracy compared to the single-view slice, and the accuracy was improved by 1.3% to 93.4% compared to the best coronal classification, which showed that the two-stage classification method showed good classification effect and could help doctors improve the diagnostic efficiency.


INTRODUCTION
Clavicle fracture is a common injury in clinical practice, accounting for approximately 5% to 10% of all body fractures and 35% of shoulder injuries [1].Clavicle fractures are classified according to the clinical Allman classification [2] into middle clavicle fractures, distal fractures and proximal fractures.The classification of clavicle fractures reveals the strength of the limb at the time of trauma and the stability of the fracture after reduction.It also allows physicians to observe and analyze the results of treatment, providing a basis for evaluating new treatments.This is why it is so important to achieve accurate classification to achieve better treatment outcomes and better healing of the clavicle.For clinicians, repeating a large number of images not only increases the workload, but also fatigue viewing and doctor's subjectivity can have an impact on the results, while some medical images may be obscured and blurred, making it very difficult for specialist doctors to make a determination, resulting in misdiagnosis and missed diagnosis.With the development of artificial intelligence, researchers have introduced deep learning into the field of orthopedics [3].In recent years, the use of deep learning for fracture detection in the human knee, wrist, shoulder and spine has started to achieve good success, and gradually began to provide clinicians with diagnostic aids to speed up diagnosis.
Fracture detection is currently based on X-ray or CT images, and most studies focus on classifying the slice data directly or reconstructing the slice in 3D before classifying it [4].CT images are body layer images with a certain thickness for a certain area, and if they are converted to slice data for feature extraction or reconstruction, it will result in information loss.However, if the 3D CT images are classified directly, the effect is limited by the size of the data set and the number of parameters is huge.In order to solve the above problems, this paper proposes a two-stage clavicle fracture detection method.In the first stage, the 3D segmentation of the shoulder CT is performed to extract the fracture region, and then the key layers of the clavicle are selected by calculating the multidimensional information entropy of the image, which are used as the input of the classification model in the second stage.Then, a multi-view fusion classification method is proposed to fuse the information of CT slices from three views: coronal plane, sagittal plane and transverse plane, which improves the classification accuracy and provides auxiliary diagnosis for doctors.

Fracture detection
Fracture detection can greatly ease the workload of clinicians and can be a good aid when medical images are ambiguous and fracture types are difficult to determine.Fracture detection covers common clinical fracture sites such as rib fractures, shoulder fractures, femur fractures, wrist fractures, etc. Olczak et al. [5] selected hand, wrist and ankle X-rays, evaluated the fractures with five deep learning networks and compared the diagnoses with two senior orthopedics surgeons, showing that human levels could be achieved, however, the study selected only one slice of images of a case to the network, which could lead to detection errors due to information omission.Chung et al. [6] used ResNet for the detection and assessment of proximal humeral fractures based on X-rays and achieved results similar to those of orthopedic surgeons who specialize in shoulder joints, however there was also a problem that only one image of a sequence is selected to the network.Based on CT images, Kim et al. [7] selected a part of the slices of the sequence images for training in each case, and the results showed that transfer learning of deep convolutional neural Networks (CNNs) could be used for automatic fracture detection of wrist joints, and large sample data could more accurately reflect the results, but the experiment only classified whether fractures occurred.Cheng et al. [8] constructed a multi-classification detection and location model for hip fractures based on X-ray films, and the results showed that the classification sensitivity of the model was 98%, and the effectiveness of the model was verified by the visualization algorithm Gradient Weighted Class Activation Mapping (Grad-CAM).

Segmentation model
The U-Net [9] was originally proposed to solve the medical image segmentation problem.The U-shaped structure of U-Net includes an encoder for feature extraction and down-sampling, and a decoder for upsampling to obtain the final segmentation result.Skip connection fuses deep features and shallow features to reduce information loss.In general, all features of medical images are important, so the feature concatenation of U-Net skip connection can better extract feature information.Moreover, medical image datasets are generally small, and large networks are easy to overfit for medical image training with relatively single semantic information, so the small U-Net framework network will have more advantages.3D U-Net [10] extends the 2D model to 3D, and the biggest difference is that the original 2D convolution is replaced by a 3D convolution suitable for 3D images.

Swin Transformer
After the great success of Transformer [11] in Natural Language Processing (NLP), more and more research has started to apply Transformer to the field of Computer vision (CV).However, the scale of NLP is fixed, while the scale of CV varies greatly and CV requires larger resolution than NLP.To solve this problem, Swin Transformer [12] introduces two key concepts, Patch Merging and the Shifted Windows.The Swin Transformer builds hierarchical feature maps by merging image patches at a deeper level, and since self-attention is only computed within each local image patch window, the computational complexity scales linearly with the input image size.The core structure of Swin Transformer is multiple Swin Transformer blocks, which calculate the local attention of the input feature map, and then rotate the information in the feature map so that the resolution of the feature map is preserved.Swin Transformer also uses staged feature integration and channel reorganization to reduce computational complexity and improve the performance of the model.Down-sampling and up-sampling operations are used between each Block to reduce the resolution and increase the number of channels in the feature map.Finally the classification results are output via global average pooling and a fully connected layer.

Channel attention mechanism
Channel attention [13] focuses on what kind of features are meaningful.Channel attention allows different weights to be given to different channels of the input feature map, with channels with higher weights being given more processing.In deep learning, the dimensions of the input data is often very high and the correlation between different dimensions is different.The channel attention mechanism can enhance the expression ability of the feature map by adaptively adjusting the feature maps of different channels during feature extraction, fully considering the interrelationships between different channels, improving the accuracy and generalization ability of the model, and at the same time reducing the number of model parameters.

Framework
Because the probability of proximal clavicle fracture is very low and the obtained CT images are too few to be included in the study, this paper divided the clavicle into no fracture, middle clavicle fracture and distal clavicle fracture based on Allman classification.Firstly, the CT images of the shoulder joint were pre-processed for 3D segmentation to extract the clavicle region, then the key layer is selected by calculating the image information entropy of the segmented image, which is converted into slice data, and finally through the selected key layers, the data from the three perspectives, namely the coronal plane, the sagittal plane and the transverse plane, are fused through the attention mechanism, and the clavicle triple classification model is established based on the Swin Transformer.The process of the clavicle fracture classification model shown in Figure 1.

Clavicle segmentation based on 3D U-Net
When clavicle fracture occurs, sometimes the fracture fragment is small, and the fracture cannot be seen only by a single perspective, so it needs to be examined from multiple angles.3D segmentation can make better use of the global information of the CT image and make more accurate segmentation.In this paper, 3D U-Net, which is suitable for medical tasks, was chosen to extract the clavicle region.

Image information entropy
Entropy [14] is a mathematical measure of uncertain information, which is used to represent the uncertainty of the source of information.The information entropy of CT image can represent the amount of information contained, which can be used to judge the richness of the image.The two-dimensional information entropy represents the joint uncertainty measure of each pixel value in the image in the X and Y directions.The mean gray value of the neighborhood of the image is selected as the spatial feature of the gray distribution, and then the binary group (,) is formed with the gray value of the pixel of the image, as shown in equation (1).

𝐻 𝑝 𝑙𝑜𝑔 𝑝 1
Where  denotes the pixel gray value, and  denotes the neighbourhood greyscale value, and  is the probability of occurrence of (  , ).In this paper, we use a sliding window and calculate the information entropy within the window to determine the selection of the slices of the clavicle image.We consider the width and height of the segmented 3D clavicle image as the size of the sliding window, that is, 112×112, and then perform sliding in the dimension where the depth of the 3D image lies to calculate the 2D information entropy within each window.

Multi-view fusion classification
In order to use more fracture information and improve the classification accuracy, we fuse the slice data from three views of coronal, sagittal and transverse plane according to the weight of information carried by each plane to obtain richer feature information, and then use Swin Transformer for classification.Firstly, the slice data of all views were unified into a single channel grayscale image, and then the channel concatenation was performed on different views of the same data.In this paper, there are 181 cases CT of shoulder joint, and 40 layers are selected for each CT after the key layers selection.After adding the channel attention mechanism, the corresponding weight will be calculated for each view.In this method, N is the number of samples, M is the number of views,  , ,  , where  denotes transverse,  denotes sagittal,  denotes coronal, and , ,  ∈  , the slice data size of each viewpoint is  .Firstly, the  views are concatenated to form an     tensor  , after the max pooling of  , the 1 1   tensor  is obtained.Then the activation function ReLU and the fully connected layer are used to obtain the 1 1   tensor  , and then the final weight  is obtained by Sigmoid normalization to the [0,1] interval, and the weight vector is {{ , …  },...{ , …  }}, and then multiply and sum with the corresponding three views of the key layers to obtain the final feature vector.Then the classification is performed.

Dataset
In this paper, 100 cases without clavicle fracture CT and 81 cases of clavicle fracture CT were collected, including 43 cases of distal clavicle fracture and 38 cases of middle clavicle fracture, all with size 112×112×112 and spacing (1mm, 1mm, 1mm) in NIFTI format.After the first stage segmentation and key layers selection, 40 slices data were selected for each CT, resulting in 7240 clavicle slices, including 4000 slices without fracture and 3240 clavicle fracture slices.Among the clavicle fracture slices there are 1720 distal clavicle fractures and 1520 middle clavicle fractures, in PNG format.

Evaluation Metric
In this paper, we use the Dice coefficient to measure the clavicle segmentation algorithm.The dice coefficient represents the similarity of two samples with a value range of [0,1] interval, as shown in equation (2).

𝐷𝑖𝑐𝑒 2|𝑋 ∩ 𝑌| |𝑋| |𝑌| 2
In this paper X denotes the theoretical segmentation results hand-labelled by experts and Y denotes the results obtained by model segmentation.We use ROC curves and AUC to evaluate the classification algorithms.The premise of calculating ROC curves is to calculate the confusion matrices, which are TP, TN, FP, FN, and in this experiment need to be calculated separately because of the multiple classifications involved, e.g. for distal clavicle fractures, TP indicates that a distal fracture is predicted as a distal fracture, TN indicates that a normal or mid clavicle fracture is predicted as a corresponding normal or mid clavicle fracture, FP indicates a normal or mid clavicle fracture predicted as a distal fracture, and FN indicates a distal fracture predicted as a normal or mid clavicle fracture.The horizontal coordinate of the ROC curve is the False Positive Rate (FPR), which represents the proportion of incorrect predictions for negative samples, and the vertical coordinate is the True Positive Rate (TPR), which represents the proportion of correct predictions for positive samples, the calculation is given below.

2) Comparison of methods:
In this experiment, we compared 3D U-Net with the traditional region growing algorithm [15] and the results are shown in Table 1, which shows that a better segmentation was achieved on 3D U-Net with a dice of 0.9441.The results of the first stage 3D image segmentation method are shown in Figure 1, where a is no fracture data and b is distal clavicle fracture data.It can be seen that the traditional region growing has good segmentation performance when dealing with no fracture clavicle data, but there will be adhesions when dealing with small fracture lines and thus cannot accurately segment the fracture fragments, while the 3D segmentation using deep learning can better identify the fracture lines.It also performs better when dealing with small fracture areas such as the class b axial surface shown in Figure 2, ensuring the accuracy of the second stage input data.

Key layers selection
In this experiment, we set the sliding window size to 112×112 and performed key layers selection for no fracture, middle clavicle fracture and distal clavicle fracture data respectively, calculated the twodimensional information entropy of each layer, and selected the most informative 40 layers.We found that for sliced data after the first stage segmentation, two-dimensional information entropy greater than 0.28 will be selected, and the results show that the key layers of the data are concentrated in the range of 15-60 slice layers, which we save as PNG format.

Parameter setting
In this experiment, we set batch size is 8 and the epoch is 80, using the Adam optimizer with a learning rate of 0.001 and loss function is binary crossentropy.

Analysis of results
Firstly, in order to verify the effectiveness of the key layers selection, we compare the key layers selection based on two-dimensional information entropy proposed in this paper with random selection and clinician selection.The results are shown in table 2 The results show that the classification accuracy of key layers selection based on two-dimensional information entropy is better than that of random selection.This is because some slices of the segmented clavicle CT image do not include clavicle information, and if these slices are trained, the classification accuracy will be reduced.Our method is 0.4% lower than that of clinician selection, achieving an effect similar to that of clinician selection, and clinician selection requires a lot of manpower and time, so our proposed method can provide a basis for further multi-view fusion.
Then, in order to verify the degree of influence of data from different perspectives on classification accuracy, we compare the same key slices and the corresponding original image layers in the coronal, sagittal and transverse planes.We use the pre-trained Swin Transformer as the backbone network of classification experiments.The results are shown in Figure 3, where Y-Acc represents the accuracy of the original image layers corresponding to the key layers, S-Acc represents the accuracy of the key layers after the first stage segmentation, Y-F1Score represents the F1-Score of the original image layers corresponding to the key layers, and S-F1Score represents the F1-Score of the key layers after the first stage segmentation.The results show that both coronal and sagittal planes showed good classification results, while coronal planes showed the best classification performance, but transverse planes showed poor classification performance, indicating that transverse planes cannot be used as a basis for diagnosis of clavicle fractures alone.For the same view, the classification effect of the key layers selected after the first stage segmentation is better than that of the original image of the corresponding slices, which proves the necessity of the first stage segmentation.MobileNet [17], and EfficientNet [18].The results are shown in Table 3.
Table 3.  providing doctors with a clearer display of the clavicle region.Then, Swin Transformer is used to fuse the layers of the three perspectives by introducing the attention mechanism to achieve better classification effect, so as to solve the problem of misdiagnosis in processing some difficult images, eliminate the detail interference for clinicians, and achieve higher diagnostic efficiency.

ResNet
The two-stage clavicle diagnosis method proposed in this paper achieved better performance and realized the triple classification of the clavicle, but this paper only focuses on the comparison between models in terms of diagnostic accuracy at present, future research should further compare the diagnostic results of the models with clinicians so as to achieve convincing diagnostic results for clinicians.And because of the labeled shoulder dataset is small, future research could focus on semi-supervised or unsupervised methods to achieve better classification results using less labeled data.

Figure 1 .
Figure 1.Process of clavicle fracture classification model

7 Figure 3 .
Figure 3.Comparison of classification results from different perspectives Finally we fused the coronal, sagittal and transverse slices from the three perspectives and selected different classification networks for comparison in order to verify the classification performance of Swin Transformer, we compare Swin Transformer with three classical classification networks: ResNet [16],MobileNet[17], and EfficientNet[18].The results are shown in Table3.Table3.Comparison of multi-view fusion results that among the four classification networks, Swin Transformer has the best classification performance with an accuracy of 0.934, followed by ResNet.MobileNet and EfficientNet have poor classification performance on this task.For the classification of single view, multi-view fusion improves 1.3% than the coronal plane with the best classification effect of single view, which proves the effectiveness of multi-view fusion.The corresponding ROC curves are shown in Figures 4. It can be seen that the multi-view fusion classification model has a good classification effect, which is expected to help clinicians in diagnosis and improve the efficiency of diagnosis.

Figure 4 .
Figure 4. ROC curve of multi-view fusion5.ConclusionsClavicle fracture is a common shoulder fracture, and there are corresponding conservative or surgical treatments for different types of clavicle fractures.In order to help doctors make better diagnosis and follow-up observations, and to address the information loss caused by using only two-dimensional single-view slice data in fracture detection, this paper implements a two-stage clavicle fracture diagnosis model.Firstly using 3D U-Net to better utilize the information between CT to achieve segmentation,

Table 1 .
Comparison of the Effects of Different Segmentation Methods

Table 2 .
. Different layers selection comparison