Prior Knowledge-guided Hierarchical Action Quality Assessment with 3D Convolution and Attention Mechanism

Recently, there has been a growing interest in the field of computer vision and deep learning regarding a newly emerging problem known as action quality assessment (AQA). However, most researchers still rely on the traditional approach of using models from the video action recognition field. Unfortunately, this approach overlooks crucial features in AQA, such as movement fluency and degree of completion. Alternatively, some researchers have employed the transformer paradigm to capture action details and overall action integrity, but the high computational cost associated with transformers makes them impractical for real-time tasks. Due to the diversity of action types, it is challenging to rely solely on a shared model for quality assessment of various types of actions. To address these issues, we propose a novel network structure for AQA, which is the first to integrate multi-model capabilities through a classification model. Specifically, we utilize a pre-trained I3D model equipped with a self-attention block for classification. This allows us to evaluate various categories of actions using just one model. Furthermore, we introduce self-attention mechanisms and multi-head attention into the traditional convolutional neural network. By systematically replacing the last few layers of the conventional convolutional network, our model gains a greater ability to sense the global coordination of different actions. We have verified the effectiveness of our approach on the AQA-7 dataset. In comparison to other popular models, our model achieves satisfactory performance while maintaining a low computational cost.


Introduction
In recent years, there has been increasing interest from researchers in the field of action quality assessment (AQA) due to advancements in motion recognition technology.AQA involves evaluating the quality of human actions, and it plays a crucial role in various applications such as healthcare manoeuvre training [3] and sports skills evaluation [4].
Traditionally, researchers have relied on models borrowed from video action recognition to address AQA tasks.However, these models do not cater specifically to the unique requirements of AQA, such as considering movement fluency and the degree of completion.These factors are essential for accurate and comprehensive action quality assessment, and the direct adaptation of video action recognition models leaves room for improvement.Different action details require distinct model structures, limiting their versatility and increasing the time and resources needed for the application.
Another approach that has been explored is leveraging the transformer paradigm [1], which provides the ability to capture detailed action and overall action integrity.However, the computational cost associated with transformers poses a practical challenge for real-time AQA tasks.Additionally, state-of-the-art models based on transformers are limited in their applicability across different action categories, requiring separate training for each category.
Motivated by these challenges and limitations, we propose a novel network structure for AQA to address these issues.Our model is the first to introduce a multi-model integration training method for AQA using a pre-classification model.Experimental results demonstrate satisfactory performance.Our approach integrates the self-attention mechanism [1] and multi-head attention [2] into traditional Convolutional Neural Networks (CNNs), replacing the final layers of a standard neural network.This enhancement enables the model to perceive global coordination across different actions.We also introduce a pre-trained I3D model [5] equipped with a self-attention block [1] ahead of our model's main structure.This advancement allows for the evaluation of different action categories using only a single model, eliminating the need for separate training for each category.
Our proposed model structure offers several advantages: 1) State-of-the-art Performance: The multi-attention mechanism enables the network to perceive contextual information in the time dimension while disregarding inconsequential spatial information to some extent.Our method outperforms all other methods and achieves state-of-the-art performance on the AQA-7 dataset.
2) Efficiency: Compared to commonly used transformer models, our attention-based convolutional method allows the model to consciously focus on important features, reducing computational costs.The introduction of pre-classification and attention blocks also enables our model to perform well on different action categories, making it a general structure for AQA.
In this paper, we will present the details of our proposed model, including the organization of different components and their purposes.We will demonstrate the effectiveness of our model using the AQA_v2 dataset.Compared to other prevalent models, our approach achieves satisfactory performance while maintaining significantly lower computational costs, making it a promising solution for real-time AQA tasks.

Self-Attention Mechanism.
As the most key part of the Transformer [1], Self-Attention Mechanism was first applied in the neural language processing field.Since then, many Transformer-based models have been published by researchers, BERT [9], which came out by Google is the most famous among them.Due to its merits of high parallel efficiency, low information transmission loss, and high information fusion efficiency, Transformer has also been applied in image recognition.In 2020, Google proposed Vision Transformer [10], which uses Transformer for image classification [11].The core of the Self-Attention mechanism is to extract the features of pictures and videos to enhance the precision of training and reduce the number of parameters.This is the same as the SE module.In fact, it can be said that the SE module is a special Self-Attention mechanism.

Squeeze-excitation
Squeeze-excitation module was first presented in SENet [6], the champion model of ImageNet2017.SE (squeeze-excitation) attention module is a channel attention module, which can enhance the channel features of the input feature map without changing the size of the input feature map.As shown in Figure 1, its role is to improve the performance of the model by learning the relationship between different channels based on retaining the original features.In a convolutional neural network, the SE module can dynamically adjust the weights of different channels to improve the performance of the model.In the paper, it can be seen that after joining the Inception network [7] and ResNet network [8], the SE module has significantly improved the accuracy of the network.Inspired by this work, we applied this attention mechanism when extracting action features in videos and combined it with C3D and I3D networks for the evaluation of action quality.

Multi-head attention method
The Multi-Head Attention mechanism is first been put forward by Transformer [1], enabling attention to multiple positions in the input sequence concurrently, thus capturing a diverse range of information.As shown in Figure 2, this mechanism transforms input data into queries, keys, and valuestransformations learned during model training.Attention scores between query-key pairs form a distribution over input positions, yielding a weighted sum of values as output.The term 'multi-head' refers to the parallel application of this process, each with unique learned transformations for queries, keys, and values, facilitating simultaneous focus on different data features.The multi-head attention mechanism is adopted in our model as a more general receptive field over the overall feature of actions.

Video Action Classification
For most of the current work about video action classification, the base model is C3D [2], while our primary model incorporates a Two-stream Inflated 3D ConvNet (I3D) [5] and several Squeeze-Excitation (SE) blocks to improve action recognition and assessment.As shown in Figures 3 and 4, the initial design with a single I3D network, an extension of the 2D Inception-v1 architecture [7] into 3D, fell short of our expectations.Incorporating SE blocks into specific layers significantly enhanced our model's performance, almost reaching our desired benchmark.Notably, in our I3D architecture, the depth parameter requires independent training, distinct from the width and height parameters derived from Inception.

Action quality assessment
The majority of current Action Quality Assessment (AQA) methods concentrate primarily on two areas: sports video analysis [4] and surgical manoeuvre assessment [3].Since the different focus and main point between action recognition and action quality assessment tasks, traditional 3D convolutional networks like C3D and I3D leave much space to be improved.Motivated by this urgent need, we design a brand-new structure to evaluate action quality.Inspired by the structure of Resnet50 [8] in the picture classification field, we adopt a new 50-layer neural network as our backbone structure.Different from traditional methods using MaxPooling, we use 1×1×1 convolution for dimensionality reduction.Also, the last few convolution layers are changed into multi-head attention layers.Through our experiment, we found that multi-head attention works perfectly to help the model capture the overall feature of actions when placed in the last few layers.The specific and detailed approaches are explained in the following paragraphs.

Overview
Our whole network structure is shown in Figure 1.An input action video with N frames is given to a network of I3D, which will clarify the label of the input.After getting the label, the Model Selector will choose the corresponding model for the input and push it into the action quality assessment part which combines three stages of multi-bottleneck layers.In this part, our model will extract the motion feature and send it to the multi-head attention layer in order to get global awareness of our motion feature.Then proceeded by the fully connected layer, the network will get the final ground truth score of the input video and form it as the output.

Pre-classification stage
Given an input action video with N frames ܸ = ‫ܨ{‬ } ୀଵ ே , pre-trained classification model is used to identify the specific category of the action.After the feature extraction process conducted by I3D model, an extension of Inception-v1 (Figure 2), the output feature map ݂ = ‫ݕ{‬ } ୀଵ ௧_௦௭ , ‫ݕ‬ ‫א‬ Թ ௐ ᇲ ×ு ᇲ .Then SE attention mechanism [6] (Figure 1(a), Figure 3) is adopted to achieve global spatial information aggregation.
Figure 2. A simple paragraph showing the implementation mechanism of Inception-v1.The realization of the SE attention mechanism can be shown as followed: First is the squeeze stage.Global information is compressed into a single value for each channel using the average pool, getting a simply processed feature z.In the above formula, c represents batch size.
Then is the excitation stage.Weights for each channel are calculated via a fully connected layer.Here, g(z, W) represents the fully connected network, į represents the ReLU activation function, ı represents the sigmoid activation function, ܹ ଵ ‫א‬ Թ ‫ݎ/‬ × ‫ܥ‬ and ܹ ଶ ‫א‬ Թ × ‫ݎ/ܥ‬ are the weights of the fully connected layers, and r is the reduction ratio, in our project it's set to 16.
Finally, the feature maps are rescaled to their original size and distinctive weights are multiplied.After the SE attention mechanism, all channels share the same spatial information, effectively modeling the dependencies between channels.Figure 3.The visualization of the squeeze-excitation mechanism After the SE attention mechanism, the processed feature map will be sent to the MLP block to generate results, which will be a single-dimension vector ܺ = ‫ݔ{‬ ଵ , ‫ݔ‬ ଶ … ‫ݔ‬ } ୀ௦௦_௨ , representing the possibility of different action classes.And from this vector, we will get the class of the input video and select the corresponding model by an action classifier (Figure 4).Also, the original video will be preprocessed into grayscale video and normalized to the desired output size ܸ ᇱ = ‫ܨ{‬ ᇱ } ୀଵ ே , ‫ܨ‬ ᇱ ‫א‬ Թ ்×ௐ×ு , then delivered into the main structure.

Action quality assessment
In the action quality assessment stage, our model uses the newly designed structure shown in Figure 1(b) to extract the motion feature.Our model will first receive the category information generated by the action classifier to determine which model needs to be loaded.Then the original video frames are processed within three stages of multi-bottleneck layers, getting the result feature maps ݂ ᇱ = ‫ݕ{‬ ᇱ } ୀଵ ௧_௦௭ , ‫ݕ‬ ᇱ ‫א‬ Թ ்×ௐ ᇲ ×ு ᇲ .In order to capture temporal information contained in different feature maps, convolutional layers with 1×1×1 kernel size are used for dimensionality reduction.Feature maps extracted by our model are full of spatial and temporal information.Then in the last stage of multibottleneck layers, convolutional layers with 1×1×1 kernel size are replaced by multi-head attention layer as shown in Figure 1(d) and Figure 5.For the input feature map ݂ ᇱ = ‫ݕ{‬ ᇱ } ୀଵ ௧_௦௭ , ‫ݕ‬ ᇱ ‫א‬ Թ ்×ௐ ᇲ ×ு ᇲ , the multi-attention mechanism works as followed: For a given input frame sequence, we first need to generate three kinds of vectors for each position, which are Query (Q), Key (K), and Value (V).These vectors are obtained through a fully connected layer and calculated as above, where ܹ ொ , ܹ , ܹ are weights to be learned.
Figure 5.The visualization of the multi-head mechanism and the transforming of K, Q, V Next, for each position, we calculate its similarity with all other positions.This is achieved by calculating the dot product of Query and Key, that ‫ݕݐ݅ݎ݈ܽ݅݉݅ݏ,‪is‬‬ = ܳ × ‫ܭ‬ ் .Then, we scale this matrix by dividing it by the square root of ݀ , where ݀ is the dimension of Key, to avoid excessively large computations.Finally, we apply a SoftMax operation to this matrix to get the attention weights of each position for all other positions.The specific formula is: Multi-head attention is based on this process.Then for each set, we carry out the above calculation process to get h sets of results.Then, we concatenate these h sets of results and perform a linear transformation to get the final output.If expressed in the formula, it would be: where each ݄݁ܽ݀ is the result of the above ‫,ܳ(݊݅ݐ݊݁ݐݐܣ‬ ‫,ܭ‬ ܸ) calculation, and ܹ is a weight to be learned.Through our experiment, we discovered that replacing the last multi-bottleneck stage with a multi-head attention block achieves the best performance.

Score prediction
After multi-head attention blocks, the output feature maps ݂ ᇱᇱ = ‫ݕ{‬ ᇱᇱ } ୀଵ ௧_௦௭ , ‫ݕ‬ ᇱᇱ ‫א‬ Թ ௐ ᇲᇲ ×ு ᇲᇲ are fed into MLP_block to generate the final quality score according to different action categories and datasets.To predict scores using the MLP block, the network takes input features and passes them through a series of hidden layers.The final layer of the MLP block is a single neuron producing the predicted score as its output.The loss function we applied in the prediction part is Mean Square Error (MSE).Through iterative training, the prediction network minimizes the discrepancy between predicted scores and actual scores.As a result, by transmitting global awareness to the MLP block we can obtain an accurate score for the action from its output.

Dataset
The dataset we choose to evaluate our model is AQA-7 [13].AQA-7 is a comprehensive dataset used for the task of AQA.It contains clips of seven categories from major sports events such as the Olympics and world tournaments.The dataset comprises a total of 1189 video clips: singles diving-10 m platform, gymnastic vault, big air skiing, big air snowboarding, synchronous 3 m springboard diving, synchronous 10 m platform diving, and trampoline.For most of the dataset, there's little angle variation or camera movement within a single clip except for big ski and big air, which suits well for our task.The clips are all labeled with scores, and they can differ significantly based on the execution.

Evaluation Metric
Spearman's rank correlation is adopted to determine the strength and direction of the relationship between ground truth and predicted value.The coefficient ranges from -1 to 1.A value of +1 indicates a perfect monotonic positive relationship, where higher ranks in one variable correspond to higher ranks in the other variable.A value of -1 indicates a perfect monotonic negative relationship, where higher ranks in one variable correspond to lower ranks in the other variable.A value of 0 indicates no monotonic relationship between the variables.Spearman's rank correlation does not assume a specific distribution of the data, which suits well the case of neural networks.The computation of Spearman's rank correlation is as follows: where p denotes the ranking of the series of ground truth and q is the predicted value.The average performance is calculated using Fisher's z-value [12].

Details
Because our model presents a universal pipeline for the evaluation of actions across different categories, we discarded the trampoline for its long duration.The length of the clips left is all under 180 frames, of which 803 were used for training and 303 for testing.The batch size was 1 and the training epoch was 300.All our videos are normalized to 103 frames, as provided by the AQA-7 dataset.Also, the frame size is adapted to 240x180 for our model to process.All training epochs are processed on two NVIDIA Corporation GP104GL GPUs.Adam optimizer is adopted and the learning rate is set to 0.0001, and the basic code structure we used is Pytorch [14].
In view of model performance, we achieve satisfactory results on AQA datasets, as shown in the first three rolls of Figure 6.However, the less satisfactory results are observed in clips where the human figure appears very small or involves complex body coordination.For example, in the last row of Figure 6, the snowboarder somewhat blends in with the background, posing a challenge for our model to discern their movement accurately.The limited presence of the human figure in the frame, along with a distracting prominent flash of light, also contributes to the outcome falling short of expectations in this clip.In general, these issues can lead to some less accurate results from our model.

Comparison with states-of-the-art on AQA-7 dataset
In comparison with state-of-the-art, our model outperforms current models in several categories.Our experiment proves that systematically replacing convolutional layers with multi-head attention layers will achieve satisfactory results as shown in Table 1.Our improved network can capture richer temporal and contextual features compared with traditional convolution networks.Also, by adapting and adopting a traditional convolution network, we avoid the dense computational cost and long processing time of pure attention-based methods.Also, our newly designed network showcases exceptional flexibility in handling a wide range of action categories, surpassing the need for modifying the model structure for different actions.The effectiveness of our approach is evident from the commendable generalization results achieved across diverse actions, as demonstrated in Table 1.This can be attributed to the introduction of a multi-head attention mechanism, which empowers the model to seamlessly adapt to the intricate details of various actions.Moreover, our network leverages convolutional techniques for dimensionality reduction, effectively preserving crucial temporal information to a significant extent.By combining these advancements, our network stands out as a robust solution capable of accommodating diverse action scenarios while maintaining high performance and efficiency.Effectiveness of attention mechanism.In order to prove the introduction of the attention mechanism, we compare the different performances of non-attention networks and attention-introduced networks.We evaluated their performance on all tasks.As shown in Table 2, the average Spearman correlation coefficient of the multi-head introduced model was 0.8472, which is significantly higher than the average performance of the non-attention network on its tasks.According to our ablation study, the multi-head attention network outperforms the non-attention network in all evaluated tasks.This strongly indicates the importance of the attention mechanism and multi-head structures in these tasks, providing the model with the ability to capture various complex patterns, thus achieving better results across tasks.Importance of action classifier accuracy.From Table 3, it is evident that the high-accuracy classifier yields higher average Spearman correlation coefficients (0.8472) across all actions, compared to the low-accuracy classifier (0.3597).This suggests that the high-accuracy classifier is more precise in predicting these specific actions.This could imply that the high-accuracy action classifier is crucial to the performance of the model.Their ability to accurately predict and categorize actions based on input data, with a high degree of correlation to the actual outcomes, allows our model to function effectively and efficiently.

Conclusion
Recently, most approaches to Answerable Question-Answering (AQA) have primarily relied on action recognition or action classification models utilizing conventional Convolutional Neural Networks (CNNs).However, these models often overlook critical aspects of AQA, such as the smoothness of movement and the degree of completion.To address this limitation and consider essential factors like coordination and minute execution details holistically, we present a novel pipeline for AQA.Initially, we developed a classifier capable of accurately discerning the precise action being performed.To achieve a more comprehensive and extensive evaluation, our model integrates I3D (Inflated 3D ConvNet) and employs a multi-head attention mechanism.This combined approach allows our model to assign a score to the action while considering a broader range of contextual information.To validate the effectiveness of our approach, we conducted extensive experiments on the widely-used AQA dataset, AQA7-v2, and compared our results against state-of-the-art methods.Our model demonstrated superior accuracy and overall performance, outperforming existing approaches in multiple sports categories while maintaining a low computational cost.Based on these promising results, we believe that our model exhibits great potential for real-time action assessment tasks at scale.

Figure 1 .
Figure 1.Overview of our project for action quality assessment.

Figure 4 .
Figure 4.The detailed structure of the action classifier

Figure 6 .
Figure 6.Visualization of our results on the AQA dataset

Table 1 .
Comparison with state-of-the-art on AQA-7 dataset.

Table 2 .
Comparison with state-of-the-art on AQA-7 dataset.

Table 3 .
Analysis of the accuracy of the action classifier.