Cricket Video Events Recognition using HOG, LBP and Multi-class SVM

The world has witnessed a growth in multimedia data, especially video data over the past few years due to increased internet bandwidth and higher processing power of computers. Having a large number of video data also require techniques to store, summarize, index and information retrieval. More attention has been given in recent years to develop techniques which summarize, index and retrieve sports videos due to its commercial aspects. This paper proposes a framework which classifies a cricket video into one of the four events namely Bowled Out, Caught Behind, Catch Out and LBW Out. The framework uses training videos from each category of event and summarizes the videos into key frames. HOG and LBP features are computed for key frames and fused to form a single feature vector which will be labeled accordingly which represents a single event in video. Feature vector is given to a Multi-Class SVM which classifies the video into one of the four events. The experimental results show that the Precision of our technique is 77.23%, Recall is 77.86%, F-Measure is 77.55% and the Accuracy is 65.62%. The evaluation metrics of our technique are promising, because in the literature there is no other technique present so far for event detection & classification in cricket videos.


Introduction
The world of multimedia has witnessed exponential growth in video data. Video data has grown in recent years mainly due to improvements in multimedia technology, improved processing power, faster and robust networks. Improvements in the technology has led to generation of a vast amount of video data, belonging to different areas like movies, sports, surveillance and news etc. [1].
Due to huge money-making capacity and large number of TV viewership in a game of cricket, the automatic highlights generation is considered to be of utmost importance [3]. Most of the viewers are only interested in watching a compact version of a completed game rather than a full match. The main contributions of proposed framework are as follows: • An automatic event detection and classification from cricket video is presented in the proposed framework. The proposed framework is one of a kind, as there is no other technique present in the existing literature which detects and classifies important events from a cricket video. There exist other techniques which are based on cricket videos but they do not explicitly do detect and classify the important events from those videos. • There does not exist any standard dataset for training and testing of events present in a cricket video. The dataset for this purpose has been manually developed, by separating those clips 2 from a large cricket video which belonged to an event. Thus 160 clips each belonging to four events namely Bowled Out, LBW Out, Caught Behind and Catch Out were created manually. The rest of the paper is organized as follows. Section 2 of this paper discusses the related work, Section 3 presents the proposed framework, in Section 4 the experimental results are discussed and in Section 5 the paper is concluded.

Related Work
In this section we discuss previous research in the field of sport video processing and analysis. In the past few years there has been an extensive research in the field of semantic analysis of sports videos. Automatic sport video annotations, sports video indexing, video retrieval and automatic highlights creation by making use of semantically important sports video content along with multimodel data was focused in the research [4]. Semantic analysis of sports video is considered as a challenging task, the presence of huge amount of sport video, different numbers of sports video broadcasters and existence of semantic gap between the high level and low level features [2]. The existing research in the field of sports video analysis can be broadly divided into two categories genrespecific and genre-independent. Most of the research in the field of semantic analysis of sports video is genre-specific, mainly because every sport is played differently and has different number of rules and actions. Genre-specific research focuses on specific sports video like soccer [2], Baseball [6], Volleyball, Tennis, Cricket [3,8], Golf and Basketball. For detection of an event in a particular sports video, it is not wise to claim that a genre-independent solution would provide feasible solution because no two sports are the same and every sport has its own set of rules and structure. American Football videos have been summarized by making use of textual overlays present in the videos by [9].
In this paper we propose a framework which detects and classifies significant events from a cricket video, which are Bowled Out, Catch Out, LBW Out and Catch Out. Our framework is unique in a sense that there is no other technique which exists in the literature which detects and classifies events from a cricket video. There was no standard dataset of cricket videos, so dataset containing events from a large set of cricket videos was created manually by separating video clips containing an event. Each video in the dataset belongs to a particular event and these videos are used for training and testing purposes.

Proposed Method
In this section, all the processing steps involved in our proposed framework are explained in detail. Our video dataset Ð is divided into training and testing dataset Ð Training and Ð Testing , where the dataset Ð consists of video clips from four different types of events i.e., Bowled Out, Catch Out, LBW Out and Caught Behind Out. The training dataset is represented as where N is the total number of videos in our video dataset Ð. During the training process videos in the training dataset are summarized into five key frames which are extracted for further processing by employing the summarization technique proposed in [10]. For instance, a video Ѵ i T from our dataset is taken and a set of frames are extracted from it where n = 5 and Ƒ i v represents one particular video from the database and Ƒ i v1 represents the first frame of the first video. The experimental results show us that five key frames are sufficient for further processing and five key frames represents the entire video clip. Every extracted key frame's size is adjusted to 125 by 250, it is converted into grayscale and each image is enhanced by applying Median Filter and Histogram Equalization to remove any noise and blur. In the next step HOG descriptor is used for every key frame and combined to form a HOG feature vector for a single video clip. Then LBP descriptor is used to extract features of all five key frames of the video and combined to form a feature vector. In the next step HOG and LBP feature vectors of all the five key frames of a single video clip are fused to form a single feature vector.  In case of Ƒiv which is the set of extracted frames from any video, if HOG is applied on this set of images and combined together then it will be represented as a feature vector Ƒiv HOG, after applying LBP on the frames we will get its combined feature vector Ƒiv LBP. Once we get the features Ƒiv HOG and Ƒiv LBP, we fuse the two features together to get a single feature vectorϜ ( ⊕ ). Feature vector is computed for all the extracted video frames and they are assigned a label according to the events they belong to as Ĺi = {Ĺ1, … Ĺ4} where i represents all four events. For training purposes these labeled features are given to a multi-class SVM ç. In testing phase a video is given as an input from the training dataset ÐTraining, key frames are extracted from the video and preprocessed, their features are extracted accordingly and represented as Ϝ ( ⊕ ) ������������������� and are tested against the their labels Ĺi. The proposed framework is shown in the Figure 1. Details of all the symbols of the framework are shown in Table 1.

Frame Selection and Pre-processing
In the first step each video Ѵ i T from Ð Training is given to the system as an input. If the number of extracted key frames from each video is different, then the size of fused feature vector of a particular video clip would also be different which will pose a problem during testing phase. Once the key frames of a video are extracted, certain preprocessing operations are applied on all the key frames. In order to remove any noise in the images, a media filter of size 3 x 3 is applied on all images of a single video. The filter replaces the neighborhood pixel's value to the median value and helps in removing salt and pepper noise. Histogram equalization is applied on the images so that the image contrast is enhanced and provides better features for accurate classification.

Feature Extraction
In the proposed methodology, two feature Histogram of Oriented Gradient and Local Binary Pattern have been used for all extracted images belonging to each class of videos. HOG features of the selected frames are combined and similarly LBP features of frames are fused to form a consolidated one dimensional vector, which represents a complete event and labeled accordingly.

Event Classification
Our proposed framework uses Support Vector Machine (SVM) for classification of the cricket events from videos. SVM is a machine learning algorithm based on supervised learning, which means that labeled data is required for classification by SVM. SVM is a linear model which learns a line often called as a hyperplane, this hyperplane separates the data into two classes. Conventional SVM does binary classification i.e., it classifies the data into two classes only. In our proposed framework we need to classify the cricket events into four different classes, so we will make use of Multi-class SVM. SVM was designed for binary classification but it can be tweaked into a classifier, which does multi-classification by employing several techniques like one versus rest or one versus one, which converts a single multi-classification problem into several binary classification problems. In our proposed framework we have utilized the one versus all technique. Since the amount of features is on the higher side, so we have made use of SVM with a linear kernel δ which is described as follows (See Eq. 1): Given our labeled training sample which is represented as (Ϝ ( ⊕ ) ������������������� , Ĺj ), Ĺj ϵ {1,2,3,4} and the classification can be shown as (See Eq. 2): (2) Where the Lagrange multipliers of dual optimization problems, δ is the kernel function and ß is bias of the hyperplane.

Experimental Results
The main goal of this experiment is to evaluate the efficiency of the proposed system in detecting different cricket dismissals. Each of the four cricket dismissals as discussed in the previous section are tested on the test dataset or the cross validation dataset. Each dataset is comprised of video clips, where each video clip contains one event. Experimental results have demonstrated the performance and robustness of the proposed framework in providing flexibility to detect different events. The proposed system is implemented in Matlab (2017a), using Digital Image and Computer Vision Toolbox. All experiments were performed on Intel Core i7 CPU frequency of 1.8 GHz and 6GB of RAM.

Dataset
Since there was no standard dataset which could be used for event detection, so dataset was developed manually from cricket videos on the internet. In our experiments, the videos in all domains totaled 640, belonging to four events and there are a total of 160 video clips for each event. A total number of 480 videos were adopted for training four events. The summary of event learning using our dataset is provided in Table 2. We evaluated our proposed system using three common metrics of measurements that are accuracy, detection rate and positive predictive value (PPV). Where TP (True Positive) indicates that the system correctly detects an event; TN (True Negative) indicates that the system performs correct rejection, FP (False Positive) indicates the false alarm from the system and FN (False Negative) indicates the miss detection from the system. We manually annotated the ground truths where events and non-events are identified for each of the dataset. The average accuracy for the four events detectors is 65.625%; average of detection rate is 77.8% and average PPV is 77.23%.

Classified Visual Results
The video is given as input to frame extraction algorithm as explained in the proposed methodology section. Our algorithm summarizes the video clip into five key frames as shown in Fig. 2. Then these key frames are given as input to the classifier. These frames are classified into five frames for four events namely 'Bowled Out', 'Caught Behind', 'Caught Out' and 'LBW Out'. Bowled category of the video clips contains all Bowled Out event videos and same as LBW, Catch and Caught Behind. First of all, the video is given as an input to the key frame extraction algorithm as explained in the proposed methodology section. Then these five frames are given as input to the proposed classifier which accurately classifies the video as Bowled, LBW, Catch and Caught Behind category.
The F-measure of the proposed system is given in Table. 3. In this table, the first row represents the values for Bowled event category. The next three rows contain the value of LBW, Catch and Caught Behind event category. Where Caught Behind event category has the highest TP value. Then, a total is calculated by adding the values of all the four events. In the end Precision, Recall, F1-score and Accuracy are calculated. Table. 3 is the subjective analysis results of the suggested technique to check over different descriptors and find the precision and recall of each descriptor, the performance result of our proposed method have high precision and recall value. The below graph (See Fig. 3) shows the comparison of different video events with ground truth that has been proved in Table. 3. The black line shows the values of the Bowled event category and the red, blue and green lines show the values of LBW, Catch and Caught Behind event category.

Conclusion
In this paper a novel framework for detecting events in cricket video is presented. An input video clip is summarized into key frames. HOG and LBP features are computed for each key frame and fused to form a single feature vector and the video clip is labeled accordingly. For training and testing purposes Multi-Class SVM is used. There is no method present in existing literature which detects important events in a cricket video. The evaluation metrics used to evaluate our technique with the ground truth was Precision, Recall, F-measure and Accuracy which shows good results. Our proposed technique uses HOG and LBP feature descriptor. The size of our feature vector is too large due to the descriptors used. In the future work, we would be exploring such descriptors which are not just robust but more efficient than the current descriptors used.