Sign Language Word Detection Using LRCN

Sign language is the most effective communication for deaf or hard-of-hearing people. Specialized training is required to understand sign language, and as such, people without disabilities around them cannot communicate effectively. The main objective of this study is to develop a mechanism for streamlining the deep learning model for sign language recognition by utilizing the 30 most prevalent words in our everyday lives. The dataset was designed through 30 ASL (American Sign Language) words consisting of custom-processed video sequences, which consist of 5 subjects and 50 sample videos for each class. The CNN model can be applied to video frames to extract spatial properties. Using CNN’s acquired data, the LSTM model may then predict the action being performed in the video. We present and evaluate the results of two separate datasets—the Pose dataset and the Raw video dataset. The dataset was trained with the Long-term Recurrent Convolutional Network (LRCN) approach. Finally, a test accuracy of 92.66% was reached for the raw dataset, while 93.66% for the pose dataset.


Introduction
Effective communication through sign language relies on physical gestures and facial expressions.Sign languages use both manual and non-manual markers to convey meaning.All sign languages have unique syntax and lexicon, just like spoken languages [1].American Sign Language (ASL) is the official sign language of the United States and Canada [2].It is also the foundation for several other international sign languages and has the most extensive collection of video signs available online.ASL is the native language of a population between 250,000 and 500,000 in the United States [3].It is also prevalent in Canada, West Africa, and Southeast Asia.Many studies and innovations have been made to aid the people with communication disorder [4][5][6].In the field of image processing, there has been an increase in popularity and pervasive adoption of deep learning and machine learning techniques [7,8].Deep learning and computer vision can also be employed to help advance this goal, ensuring ease of use [9,10].A Recurrent Neural Network is a form of neural network that incorporates loops for internal data storage [11].In a nutshell, Recurrent Neural Networks can predict future outcomes by applying their knowledge gained from past encounters.The Long Short-term Memory (LSTM) network is an enhanced version of a recurrent neural network that can better retain memories [12].LSTM networks offer a solution to the vanishing gradient problem associated with RNNs, demonstrating particular efficacy in time series tasks characterized by unpredictable time lags [13].Their proficiency in learning long-term dependencies has been successfully applied to tasks such as time series classification, processing, and prediction, with notable applications including the categorization of sign language word data.In this paper we are focusing:  [17] suggested an MLP classifier-based ASL recognition system with 32 letters and digits that did not include "J," "Z," "2," or "6."The system achieved a 90% accuracy rate.Moreover, to fully recognize American Sign Language (ASL).Chong et al. [18] developed a system for sign language recognition using the Leap Motion Controller.The system attained recognition rates of 80.30% (SVM) and 93.81% (DNN) for 26 ASL letters; when 10 digits were added, the rates were slightly lower, at approximately 72.79% (SVM) and 88.79% (DNN).Cui et al. [19] developed a video sequence-based study on weakly-supervised deep neural networks using a threestage optimization technique that included an end-to-end sequence with CTC, feature extraction with alignment proposal, and model optimization with enhanced feature extraction.Besides, Jeff Donahue et al. [19] built a revolutionary end-to-end trainable recurrent CNN architecture for large-scale visual learning which permits the dissimilative extraction of hand gesture features.In addition, Y. Liao et al. [20] used a bidirectional LSTM coupled with 3D residual networks and BLSTM-3D ResNet for two different datasets.They achieved 89.8\% and 86.9\% for the DEVISIGN-D dataset and SLR_Dataset using their method.

Methodology
The section commences with a concise overview of the dataset, encompassing discussions on preprocessing methodologies and the utilization of MediaPipe.Subsequently, it delineates the systematic architecture of the integrated Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM)-based Long-Term Recurrent Convolutional Network (LRCN) model.

Dataset Description
In this study, we introduce a novel dataset designed to evaluate the performance of the Long-Term Recurrent Convolutional Network (LRCN) model.The dataset was meticulously curated through the collaborative efforts of 5 users, featuring 30 American Sign Language (ASL) terms.Comprising customprocessed video sequences with 50 sample videos for each class, the dataset focuses on the 30 most commonly used words by deaf, and people without disabilities in their daily communication.The primary objective is to facilitate a universal understanding of these terms, thereby bridging the communication gap between hearing individuals and those who are deaf.The data collection involved 3 the collaboration of 5 users under varying light conditions, adding a diverse range of environmental factors to enrich the dataset.Table 1 illustrates the lexicon of words or signs incorporated in our dataset.

MediaPipe
MediaPipe [21] is a framework for constructing machine learning pipelines for the processing of timeseries data such as audio, video, etc.In numerous applications, such as measuring physical activities, sign language recognition, and full-body gesture control, human position estimation from the video is crucial.In our case, it plays an important role in detecting different posed landmarks.The MediaPipe framework's primary purpose is to facilitate the quick creation of perceptual pipelines, complete with artificial intelligence models for inferencing and other components that may be reused.In addition to this, it makes it easier to integrate computer vision applications into demonstrations and actual software programs running on a variety of hardware platforms.We used MediaPipe for pose, face face-hand connections in our pose dataset.Dataset Pre-processing following the collection of videos for each word, each video was transmitted in a single cycle.The number of frames was then checked.The majority of videos have less than 30 frames.Therefore, we consider 30 frames for every video cycle.To pre-process the videos, we up-sample and down-sample; we skipped the same frames if the video had more than 30 frames.Then, the dataset's video files were read.The frames were resized to a constant width and height to reduce calculations and normalize the data to the range by dividing the pixel values by 255 to promote convergence during network training.In total, we provide 1,500 training videos.50 for each subject.1200 samples for the training set and 300 for the test set.Figure 2 visually depicts both the raw video and pose video of the "goodbye" sign, employing MediaPipe for pose extraction.

LRCN Model
The Long-Term Recurrent Convolutional Network (LRCN) [22] is a model that combines convolutional neural network (CNN) layers and long short-term memory (LSTM) layers.Images' spatial information is extracted using Convolutional layers, and the collected spatial features are then passed to the LSTM layer(s) at each time step for temporal sequence modelling.Convolutional neural networks (CNNs) [23] are great at finding patterns in the data that are local to a specific area.They are very good at finding patterns, which they then use to put images into groups.
A kind of neural network known as recurrent neural networks [19], or RNNs, are particularly well-suited to the processing of time-series data and other types of sequential data.RNNs that need to remember things for a long time are hard to train, while LSTMs do better in these kinds of datasets because they have more special units that can hold information for longer.Since we are now working on videos, the LRCN model is an appropriate fit for us.To implement our LRCN, after a time-distributed Conv2D layer.We used MaxPooling2D and Dropout layers the time-distributed layer is useful for processing time series data or individual video frames.Figure 3 delineates the comprehensive framework of the proposed LRCN model utilized in our dataset.With this method, each input can be handled by its layer.After being collected from the Conv2D layers, the feature is flattened in the Flatten layer before being passed to an LSTM layer.The Dense layer, activated with SoftMax, takes the LSTM layer's output into account while making its prediction.The definitions of SoftMax function are as follows: The model framework shows that there are 0 non-trainable parameters, whereas the total number of parameters is 73,588, and there are 73,588 trainable parameters.Table 2 provides the layer type, output shape, and input shape.The size of the input is 64 by 64 by 3.And the LSTM layer geometry for our network is 32.

Results and Discussion
The dataset contains a total of 1,500 videos, each of which is assigned 50 videos.Specifically, the training set consists of 1,200 samples and the test set consists of 300 samples.This allocation guarantees a balanced and representative distribution for effective model training and assessment.The LRCN approach achieved 92.66 percent testing accuracy for the raw video dataset and 93.66 percent for the pose video dataset.It recognizes a total of 277 out of 300 words in raw videos.In addition, for the pose dataset, 282 out of 300 words were recognized.Table .3 summarizes the results of two types of datasets.
High recall and high accuracy, wherein high precision is associated with a low false positive rate, and high recall is associated with a low false negative rate.The Pose dataset has better class-wise precision and recall than the raw dataset, as shown in Table 4. Since every data is composed of only 30 frames, LSTM comes out as a useful tool.But if the data size is large (more than 100 frames) then the accuracy will not be such high.Hence for that case "Regression with only CNN" and "3D CNN" can be useful.For video classification, Recurrent Neural Network is also popular.But apart from all of these, LSTM is chosen because of its attribute that preserves the frame-by-frame successive transmission.So, to gain more accuracy, the current model is well enough to increase the amount of raw data.Data augmentation mirroring, rotating, speed moderation, resolution mixing, etc. can be used to increase the amount of data.Increasing more convolutional layer is also a wise option.

Conclusion
This paper employs the Long-Term Recurrent Convolutional Network (LRCN) Model to demonstrate real-time detection of American Sign Language words.The evaluation encompasses two distinct types of datasets, namely raw videos and pose videos, featuring 30 commonly used terms in daily life.Spatial features are extracted using CNN layers, while LSTM is employed for temporal feature classification.The accuracy achieved on the raw dataset is 92.68 percent, and on the pose, the dataset is 93.68 percent.This study serves as a foundation for future research, aiming to propose a substantial dataset for Bangla Sign Language videos, addressing the current absence of such resources.

Figure 1 . 1 )Figure 2 .
Figure 1.Landmarks on the MediaPipe to illustrate sample data.2.3Dataset PreprocessingOur dataset has two parts, raw video dataset and pose video dataset.Which was followed by ASL sign language.1)Raw Video Dataset: This dataset consists of a regular RGB video recording of each word.The model will be fed the video.2) Pose Video Dataset: The Pose Video Dataset was developed using MediaPipe.We explored removing the background for each MediaPipe landmark for better performance.Then, frames were made and fed into the model.

Figure 4 .
Figure 4. Accuracy and Loss curve for raw and pose dataset


Introduction of a proprietary dataset comprising 30 common American sign language gestures, sourced from five users. Implementation of MediaPipe for the extraction of pose coordinates from the dataset. Development of a Long-Term Recurrent Convolutional Network model, utilizing CNNs for spatial feature extraction and LSTM for temporal aspects for two types of datasets (raw and pose). Attainment of a 92.66% test accuracy on the raw dataset and an enhanced 93.66% accuracy on the pose dataset.

Table 1 .
30 words that are used in the proposed dataset

Table 2 .
Model Framework

Table 3 .
Results of raw and pose dataset

Table 4 .
Precision and recall per class

Table 5 .
Comparison with others' work

Table 5
encapsulates a comparative analysis with prior works, highlighting the superior performance of our proposed dataset on the CNN-LSTM-based LRCN model.