Revolutionizing crowd surveillance through voice-driven face recognition empowering rapid identification: towards development of sustainable smart cities

Recent global efforts to create sustainable smart cities have significantly transformed society and improved the lives of people. Nowadays, crowd surveillance (CS) has become essential in sustainable smart cities and society to protect public safety and security. In this regard, the face-based human detection system has received considerable attention because it is recognized as an emerging method in crowd surveillance applications. Thus, in this work, a new method for real-time identification of people for a crowd surveillance system (CSS) that uses facial and speech recognition technology has been introduced. In traditional CS systems, human operators are frequently used by crowd surveillance systems to watch and evaluate video feeds. Human error and operator weariness may result in lost opportunities or slow replies, which reduce the system’s efficacy. Certain procedures, including the initial identification and monitoring of people in video feeds, can be automated using a voice-activated system. To address the issues with the present CSS, a new framework Voice-Activated Face Recognition (VAFR) is proposed in this work. The proposed framework combines the speech and face recognition models for crowd surveillance. Experimental and simulation studies have been performed to analyze the performance of the proposed VAFR framework. The proposed framework uses the Viola-Jones algorithm for face identification and the Conformer architecture for speech analysis, reaching a noteworthy 99.8% accuracy rate in live video feeds. In addition, the ethical and safety aspect of the proposed VAFR system is presented.


Introduction
In today's more congested environment, a crowd surveillance system (CSS) plays an essential role in enhancing public safety and security in the development of sustainable smart cities and society [1].The necessity for efficient crowd monitoring and control has increased more than ever before with the rise of big events, crowded metropolitan areas, and transit hubs in smart cities [2].The systematic observation and analysis of big crowds in public places or at events using cutting-edge technology devices is known as crowd surveillance.Its main goals are to increase public safety, stop possible threats, and facilitate speedy action in emergency situations.The CSS make use of a variety of surveillance technologies, including video cameras, sensors, and data analytics, to give real-time situational awareness, allow proactive security measures, and assist crowd management decisionmaking processes [3].
The role of CSS in the development of sustainable smart cities and society is highlighted in figure 1.In smart cities and society, a multitude of applications completely depend on reliable and effective CSS. Figure 1, illustrates a sustainable smart city scenario where a crowd or human needs to be monitored in real-time to offer a variety of services, such as disaster management, population counting, event management, safety monitoring, military management, and suspicious activity detection in public areas [4].Crowd surveillance has a wide range of uses in several industries.Surveillance systems support the smooth flow of pedestrian traffic, the detection and deterrence of criminal activity, and the identification of possible threats in public areas including city centers, stadiums, and entertainment venues [5].Crowd surveillance is essential for regulating passenger flow, minimizing congestion and spotting suspicious activity for increased security in transportation hubs including airports, railway stations, and bus terminals.Crowd surveillance is also used in other fields, such as urban planning, disaster monitoring, disaster risks management, and protest monitoring, where precise crowd analysis may optimize emergency response plans and guide resource allocation [6,7].
Successful monitoring and analysis of crowds is achieved via the use of several tools and technologies in crowd surveillance.Many surveillance systems are built on closed-circuit television cameras [8], which collect visual data from various angles.Then, sophisticated machine learning methods [9] and video analytics algorithms [10] are used to automatically detect abnormalities, follow people, and identify possible threats in real-time.Additionally, crowd dynamics and behavior may be better understood with the use of crowd-counting sensors, mobile device monitoring, and social media analytics.In [11], one of the early efforts on face detection and identification using a Raspberry Pi is demonstrated.To find and identify faces, the system made use of the Raspberry Pi, the Haar detection algorithm and Principal Component Analysis (PCA).In [12], a real-time face recognition system for those with visual impairments is designed using the Haar feature-based cascade classifier and a Raspberry Pi.A camera module was used by the system to take pictures and identify faces instantly.The system's accuracy of 94.85% indicated the Raspberry Pi's potential for use in creating assistive technologies for those who are blind or visually impaired.In order to increase identification accuracy under different settings, Salama AbdELminaam et al [13] proposes a facial recognition (FR) system that uses deep convolutional neural networks (DCNN) to extract face characteristics and transfer learning in fog and cloud computing.The suggested method outperformed other algorithms when it was evaluated using three datasets and the Decision Tree, K Nearest Neighbour, and Support Vector Machine algorithms with the highest accuracy of 99.06%.In [14], a real-time facial detection system utilizing the Eigenface algorithm and Raspberry Pi was presented.The system used the camera module on the Raspberry Pi, and it had a 99.63% accuracy rate.The system showed how Raspberry Pi may be used to create facial identification systems for applications related to access control and security [14].
For many years, computer vision and pattern recognition researchers have been working on integrating face detection and identification algorithms.Security systems, access control, monitoring, and other related fields have all found use for this strategy.However, manual input or touch-based interfaces are frequently required by conventional techniques, which can be unpleasant in some scenarios [15].Touchless interfaces, particularly those equipped with voice commands, enhance people's flow in congested scenarios such as airports and train stations.They also contribute to maintaining sanitation during medical emergencies and in hospitals by minimizing physical contact.These features make them ideal for crowd surveillance in various contexts, including noisy environments and emergency situations.Additionally, besides boosting productivity and efficiency, these interfaces offer enhanced accessibility and safety by reducing the transmission of viruses and allowing users to free up their hands for crucial activities.In the present work, this problem is addressed by a revolutionary method that combines facial detection and identification with voice-activated instructions.This approach uses speech recognition technology which with proper hardware is capable of giving accurate outputs in various situations like loud and noisy environments, making it easier to use and more effective.While lowering the likelihood of mistakes and misidentification, the suggested solution seeks to offer a smooth and safe authentication procedure.
The effectiveness of this strategy depends on how good and precise the facial recognition and detection algorithms are.The accuracy, speed, and complexity of numerous facial recognition algorithms were studied and contrasted in [16].The analysis demonstrated that while the Eigenface technique is straightforward and effective, its accuracy is constrained.Although the Eigenface approach is faster and simpler, the Local Binary Patterns (LBP) algorithm offers more accuracy.The most accurate method is the Convolutional Neural Network (CNN), although it is computationally costly and needs a lot of training data.The analysis made clear how crucial it is to choose the right algorithm for a certain application based on its needs for accuracy, speed, and complexity [16].A deep sparse representation classifier was put out in [17] for facial recognition and detection systems.The suggested solution used a deep neural network for classification after sparse representation was used to extract and categorize face characteristics [17].A thorough evaluation of the literature on the reliability of facial recognition algorithms was conducted in [18].The analysis discovered that the dataset size, illumination, and image quality all have an impact on how accurate face recognition systems are.For the facial recognition stage, Artificial Neural Networks (ANN) are found to be more prevalent, with a focus on CNN.The Viola-Jones method is most frequently used during the detection phase, however other phases are distinguished by clearly defined algorithms [18].The face recognition technique used in the proposed system is based on the Viola and Jones algorithm.Since all human faces share some characteristics, these characteristics can be used as haar feature to identify faces in images.
Another essential element of this strategy is voice recognition technology.Based on each person's distinctive voice-print, voice recognition algorithms employ pattern recognition techniques to identify and authenticate them.It is more difficult for unauthorized people to access sensitive areas or information when voice authentication is used in conjunction with facial recognition.A thorough comparison of several automated speech recognition (ASR) methods is provided in [19].The relevance of ASR in different applications, including healthcare, education, and security is discussed.Also, complete overview of the various ASR methods, including support vector machines (SVMs), hidden Markov models (HMMs), and neural networks.They assess the effectiveness of various strategies in terms of processing speed and word error rate (WER) [19].For directing the movement of a scout robot, a voice command detecting system is proposed in [20].For specialized tasks including movement control, obstacle avoidance, and picture capturing, the system was created to recognize a set of predefined spoken commands [20].The usage of Transformer and CNN based models for ASR is discussed in [21].These models have shown promise in outperforming Recurrent Neural Networks (RNNs) and are being used in the study.Transformers and CNNs work together to efficiently model both local and global dependencies in an audio sequence.With a WER of 2.1% /4.3% without a language model and 1.9%/3.9%with an external language model, the new architecture, dubbed Conformer, greatly outperforms earlier models and achieves state-of-the-art accuracies on the LibriSpeech test.The Conformer model's higher accuracy with fewer parameters is made possible by the demonstration of the significance of convolution modules in the model [21].
Additional advantages come from combining voice-activated instructions with facial detection and identification.The technology, for instance, can be employed in circumstances where users must maintain hand freedom or have restricted mobility.While watching an entry, a security guard may need to keep their hands free to undertake other activities, or a surgeon may need to keep their hands clean while performing a medical procedure.The voice-activated interface provides a useful and practical solution in these circumstances.To deliver a smooth user experience across several platforms, the voice-activated face detection and identification system may also be coupled with other devices including computers, tablets, and smartphones.For instance, a person might access their online accounts, confirm payment, or unlock their smartphone using only their voice.
However, this strategy might have some drawbacks and raise certain issues.The accuracy and dependability of speech recognition in busy or loud circumstances is one of the major issues.Additionally, the algorithm could have trouble identifying people who have similar facial traits or who wear eyeglasses, hats, or other accessories that might cover their faces.Additionally, the gathering and storage of biometric data raises privacy issues that need to be handled by suitable data protection and privacy rules.Ethical concerns may arise like security risks, potential of abuse and misuse of data.So when creating and implementing systems that combine face detection and identification with voice-activated commands, it is crucial to take these ethical concerns into account and address them in order to ensure responsible and ethical use of the technology as well as to safeguard persons' rights and interests.
Voice-activated instructions, facial detection and identification technologies working together might revolutionize how interaction can happen with technology and boost security in a variety of settings.This strategy is set to become a more significant and pervasive aspect of our everyday lives as deep learning algorithms and speech recognition technologies continue to progress.To solve these possible drawbacks and issues and guarantee the system's dependability and accuracy in a variety of settings and circumstances, more study and development are required.
The main objective of the present work is to develop a thorough framework for crowd surveillance for sustainable smart cities and society that combines face recognition technology with voice-activated commands.By utilizing the synergies between these two modalities, the framework seeks to improve the effectiveness and efficiency of CSS while offering a reliable and intelligent solution for crowd management in a variety of contexts, including public safety, event planning, and security operations in smart cities and society.
The contributions of the study are as follows: • Developed a Voice-activated Face Recognition (VAFR) framework for smart cities and society to provide a reliable and precise crowd surveillance system (CSS) that combines face detection and recognition technology with voice-activated commands, capable of accurately detecting and recognizing faces under a variety of lighting scenarios, viewing angles, and expressions.
• Developed the hardware and software interface to validate the performance of the proposed framework using simulation and experimental study.
• Performed comparison of the proposed framework with other state-of-the-art frameworks.
• Analyzed the ethical and safety aspects of the proposed VAFR system.
The rest of the paper is organized as follows.Section 2 presents the proposed system VAFR architecture.Voice recognition model of the proposed framework is presented in section 3. The face recognition model is described in section 4. Section 5 provides the results and discussion with comprehensive analysis with other state-of-art frameworks and an analysis on ethical and safety aspects.The paper is concluded in section 6.

Proposed VAFR system architecture
The proposed VAFR system framework is shown in figure 2 combines the usage of a camera and a microphone for facial recognition with voice commands for crowd monitoring.The technology first records live video of the audience using the camera, while vocal orders are recorded using the microphone.The system works in the following way:

Voice command audio input
An audio input comprising a spoken command from the user is sent to the system.

Speech recognition
A pre-processing step is performed on the audio input, and it may involve noise reduction, normalization, and feature extraction.The pre-processed audio is then sent into a Conformer model for voice recognition.The latest sequence-to-sequence model for voice recognition, called the Conformer, uses CNN and transformers.When an individual is named via a spoken command, the voice recognition feature converts the speech to text and identifies the name.The same is elaborated in figure 2.

Command evaluation
To establish whether a face recognition command exists in the recognized text output from the Conformer model, it is examined.Based on predetermined terms or phrases connected to the data set for facial recognition, this analysis may be conducted.The system moves on to the following stage if it finds a suitable facial recognition command.If not, it will ask the user to enter a legitimate command.

Face detection and recognition
For face detection, the system employs the Viola-Jones algorithm.To find faces in pictures, the system uses a series of classifiers that have been trained on Haar-like characteristics.The Viola-Jones method is used to analyze the video frame to find and extract probable face areas.The algorithm is trained with 3 images of the dataset which includes the front view, side view and an image in bad lighting conditions to make the algorithm as accurate as possible.The face recognition algorithm or model is run on the retrieved facial areas.The extracted faces are compared by the face recognition algorithm to pre-enrolled face templates kept in a database.If a match is made, the system can offer pertinent facts about the identified face, such as the person's identification or other information.The proposed framework will display the name of the person along with their distance from the camera.The process and outcomes are demonstrated in section 5.3.

Output
Based on the instruction and the outcomes of the facial recognition, the system offers the proper feedback or reaction.In our case, the system will show the information about the person who was recognized and the approximate distance of the person from the camera which is calculated by the simple triangulation concept formula (Distance from the camera to face=(Real world width of the face(in cm) * Camera focal length in (pixels)/Width of the face in pixels of the image).The framework employs the known dimensions of the identified face along with this formula to calculate the exact distance [22].

Conformer architecture
In this study, Conformer, a cutting-edge model architecture [21,23] is utilized for designing voice recognition model.CNNs and transformers [24], two potent machine learning frameworks, are combined in a novel way in this design.The Conformer model offers a special solution for voice recognition problems that demand both local and global contextual awareness by combining these complementary approaches [25].
The conformer model is selected because it has the potential to express long-range dependencies like transformers [24] and local dependencies like CNNs while still being parameter-efficient.In light of the inherent temporal character of audio data [25], this combination can significantly boost performance in voice recognition tasks.
The Conformer design, its implementation, and its effectiveness in a range of voice recognition tasks are all thoroughly examined in this work.In a Conformer, a convolution-augmented voice recognition transformer combines convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter efficient manner, achieving the best of both worlds.CNNs and transformers are successfully incorporated into one model architecture by the Conformer model, which takes advantage of both local and global context modeling.Here is a thorough description of the Conformer architecture:

Convolutional subsampling layer
A convolutional subsampling layer is used to first process the model's raw input, which for a speech recognition task would be an audio waveform.Convolutional operations are used in this layer to down-sample the input sequence's temporal dimension, making it simpler for the following layers to process.The conformer blocks' input is provided by the layer's output.

Conformer blocks
The cornerstones of the Conformer model are the Conformer Blocks.Four primary modules are consecutively layered within a conformer block: • A feed-forward network, which processes the input sequence independently at each time-step, is the initial part of a conformer block.This is accomplished via a combination of non-linear activations and linear transformations.This module's output serves as the next module's input.
• The transformer design is used to create the Multi-Head Self-Attention module, which enables the model to weigh various input sequence components according to how important they are to the current output prediction.It enables the model to 'pay attention' to the most crucial portions of the input when creating each component of the output, in other words.

Convolution module
This module adds a convolution layer that can detect local dependencies in the input sequence to the Conformer block.Similar to CNNs, this convolution layer works by applying a series of filters to the input sequence to extract regional features and patterns.This enables the model to recognize and interpret localized patterns in the input that the self-attention mechanism might otherwise overlook.

Second feed-forward module
Similar to the first feed-forward module, the second feed-forward module is used to process the sequence after the convolution module has done so.
In each Conformer block, each of these modules is implemented progressively, and the model may have numerous such blocks stacked on top of one another.A final linear layer that produces the model's output, such as a list of anticipated words for a voice recognition job, usually comes after the blocks in the Conformer model.

Residual connections and layer normalization
The Conformer model features residual connections and layer normalization to simplify training and avoid the vanishing or bursting gradients problem.Layer normalization is followed by a residual connection that surrounds each module in the Conformer block (feed-forward networks, self-attention, and convolution).These aid in controlling information flow via the network and maintaining adequate activation's.

Macron style
Unlike the original Transformer, which begins with a self-attention module and ends with a feed-forward module, the Conformer model employs a macaron style, which effectively implies that each block starts and finishes with a feed-forward module.
The main advantage of the Conformer architecture is the efficient coupling of local feature extraction (through the convolutional layer) with global context modeling (via the self-attention layer).Because it can comprehend both the local and overall context of an audio sequence, it is particularly effective for applications like automatic speech recognition.
Conformer performs noticeably better than the older Transformer and CNN-based models, attaining cutting-edge accuracy [26].The model achieves WER of 2.1%/4.3% on the popular LibriSpeech benchmark [27] without the use of a language model and 1.9%/3.9%on test/test other when using an external language model.In [23], illustrates the average WER for Conformer-1 model and the other popular models across 5 internal benchmarks.
The audio encoder initially applies a convolution sub-sampling layer to the input before applying several conformer blocks.The model's usage of Conformer blocks rather than Transformer blocks makes it stand out from the crowd.A conformer block is made up of four modules layered on top of one another: a feed-forward module, a self-attention module, a convolution module, and a second feed-forward module at the very end [21].

Implementation of conformer architecture for VAFR
The Conformer is employed in the context of the concept model to process the user's spoken requests [21].In order to start the face recognition phase of the pipeline when a name is recognized, this matches the spoken name to a person in the datasets.Here is a detailed explanation of how the Conformer architecture can be used in this situation: Step 1: Data pre-processing It is necessary to pre-process the audio data (speech instructions) into a suitable format before the Conformer can be utilized for speech recognition.This often entails transforming the raw audio into a series of feature vectors, such as log-Mel spectrogram features or Mel-frequency cepstral coefficients (MFCCs).The Conformer model is then fed these feature vectors as input.
Step 2: Model training A sizable corpus of speech transcription data is used to train the Conformer model.There is a text transcription that corresponds to each audio example.The model gains the ability to translate the order of the audio feature vectors to the order of the words in the transcription during training.
Step 3: Model inference The Conformer model can be employed to translate new speech instructions once it has been trained.The accompanying audio is pre-processed into a series of feature vectors and given to the model whenever a user issues a command (speaks a name).The model generates a string of words that, in theory, should correspond to the name as it is spoken.
Step 4: Integration with face recognition The facial recognition system is activated using the name that was transcribed from the Conformer model.The face recognition software begins looking for a match in the live camera feed when it recognizes a name.The name is overlaid on the video stream if a match is discovered.In the absence of a match, the system keeps looking for faces.

Step 4: Real-time operation
The Conformer model processes audio input continuously in real-time for spoken commands, and the face recognition system continuously searches the video input for faces.The programme runs in a loop, searching continuously for spoken names and faces that match.
The stages include a clear explanation of how to incorporate the Conformer architecture into the current system.More specific factors including model choice, data collection and annotation, system latency, processing resources, and potential privacy issues are into account for the actual implementation.

Design of face recognition model 4.1. About VIOLA-JONES algorithm
The Viola-Jones algorithm, a significant technique that, upon its release in 2001, completely changed the area of face identification, is the focus of this research.This effective and potent algorithm, created by Paul Viola and Michael Jones, offers a cutting-edge method for real-time facial detection, a crucial part of our surveillance system [28].
The Viola-Jones algorithm is the subject of our attention because it has a strong track record of being relevant in the industry [29].In particular, its real-time facial detection capacity makes it the perfect choice for systems needing quick identification, like the CSS covered in our study [30].
The Viola-Jones algorithm's implementation, effectiveness in face detection tasks, and potential for further development are all examined in detail throughout this research.We contrast it with other well-known models to highlight its special benefits and find areas where its functioning might be further improved [18].
The real-time face detection capability of the Viola-Jones method makes it particularly useful for systems that need quick identification.Due to its versatility and low computing cost, the technique, which was initially created for effective face identification, now has numerous applications in numerous disciplines.
The Viola-Jones approach makes use of an integral picture idea to compute features quickly and a cascading classifier to quickly eliminate non-face regions, concentrating computing resources on areas with the best chances of success.The following list outlines the several stages of the algorithm:

Haar feature selection
The calculation of several features for each image during the Haar Feature Selection stage produces a sizable pool of data for each face in the dataset.The idea of integral pictures makes it easier to compute these attributes quickly.

Using AdaBoost to build an effective classifier
Since it is impracticable to use all calculated features, the AdaBoost method is used to pick the most pertinent ones, resulting in the construction of a strong classifier from a pool of weak ones.

Cascade classifier construction
Building a cascade structure of classifiers during the final step enables the algorithm to swiftly exclude non-face regions and focus more computation on areas that are likely to contain faces.The algorithm's power comes from its exceptional blend of quickness, effectiveness, and adaptability.It is the best contender for the quick face recognition requirements of our surveillance system because it outperforms more established face recognition models and achieves cutting-edge precision.In this study, a thorough analysis of the Viola-Jones algorithm's effectiveness and performance on several benchmarks is presented.
Rapid facial recognition has advanced significantly with the Viola-Jones algorithm's incorporation into a real-time surveillance system.This improvement is invaluable for applications that call for instantaneous identification of people in crowd surveillance settings.
For a number of reasons, the Viola-Jones algorithm is frequently utilized in the field of face detection.It has several benefits over other algorithms [29,31], including the following: • Speed: The Viola-Jones algorithm may run in real-time and is incredibly quick.It uses the idea of an integral image to compute features, which significantly cuts down on computing time.
• Scalability: The algorithm is flexible in its application and can recognize faces of different shapes and sizes inside a picture.
• Robustness: The Viola-Jones method is remarkably robust and can handle a wide range of face angles, expressions, and lighting situations despite how straightforward it is.
• Effective Feature Selection: By focusing on the most important features and ignoring the unimportant ones, the AdaBoost learning algorithm is able to increase the effectiveness and accuracy of detection.
• Ability to Reject Non-Faces Rapidly: The Viola-Jones algorithm has a cascade structure of classifiers that enables it to quickly reject non-face regions while conserving processing resources.
• Low Memory and Computational Requirements: The approach is quite efficient in terms of memory and computation, making it appropriate for devices with low resources.This is due to the integral image representation and cascade structure.
In the framework of our real-time CSS, we investigate various facial recognition algorithms in this study to determine their individual strengths, limitations, and best-use scenarios.The groundbreaking Viola-Jones algorithm, which has received high praise for its speed, effectiveness, and resistance to changes in lighting and facial emotions, is a major focus of this study.The Viola-Jones algorithm continues to hold a distinct position despite the emergence of more advanced techniques like Deep Learning based models and the existence of older, established algorithms like Eigenfaces, Local Binary Patterns Histograms (LBPH), Fisherfaces, and key point extraction techniques like Scale-Invariant Feature Transform (SIFT) and Speeded Up Robust Features (SURF).
We intend to show that the Viola-Jones method is still a powerful tool for the real-time detection and recognition of faces in CSS, differentiating itself from other approaches through a set of specific benefits, as we delve into this thorough analysis.

Implementation of VIOLA-JONES algorithm for VAFR
The Viola-Jones algorithm's implementation shown in figure 3 serves as a crucial component for real-time facial recognition in the proposed surveillance system.How it can be integrated is as follows: Step 1: Training phase In this phase, a dataset made up of several photos of humans collected at various angles and in varied lighting situations is used [32].The Viola-Jones face detection model is trained using these photos.In the training phase, the system chooses a few key visual features from a broader pool of candidate features using the AdaBoost classifier.Face detection in the image is then performed using a strong classifier created from these chosen features.
Step 2: Feature extraction In real-time operation, the system starts the image processing pipeline when a user says a name that the voice recognition system recognizes.The trained Viola-Jones algorithm is used as the initial step in this pipeline to extract HAAR-like features from the webcam image [26].The method of feature extraction makes it easier to tell faces apart from other parts of the image.
Step 3: Detection phase The Viola-Jones method then determines whether or not a face is present in the image by using the robust classifier created during the training phase [33].If a face is found, a region of interest (ROI) is designated in the image where the face was found.
Step 4: Phase of identification After detecting a face, the model compares its traits to those in the current dataset in an effort to identify it.If the algorithm determines that the face belongs to the person the user has identified, it superimposes the person's name over the webcam stream where the face was found.
Step 5: Real-time operation This procedure is carried out in real-time when the user issues commands and the camera stream is searched for faces.The algorithm adds the person's name to the camera stream when it recognizes the right face [29,31].If no faces are found, the machine keeps looking around until one is discovered.
In conclusion, the suggested system's face detection and identification component relies heavily on the Viola-Jones method.It is a solid and trustworthy option for this application because of its speedy and effective real-time facial detection capabilities [33].

Database
Datasets consist of images of 20 people.Each person has sent two images in different angles to be specific front and left or right side of their faces.The datasets also include each individuals images taken from different angles under bad lighting conditions so that the accuracy of the system may be more.Table 1 shows the three datasets that have been used to test the face recognition model.

Hardware and software tools
The tables 2 and 3 show the hardware and software tools, respectively used to realize the system.

Performance of proposed voice activated face recognition (VAFR)
The VAFR framework, proposed in section 2, is demonstrated and validated in the current section.The HP W100 digital webcam, which has a resolution of 480 Megapixels, was employed in this instance and to record

Hp w100 digital webcam with
A webcam is used to record or capture images of someone's face.We are using the HP w100 digital Webcam 2.
Raspberry Pi 4b It is a well-liked, inexpensive, credit-card-sized computer used for many things, including facial recognition software.
audio input, such as a spoken command, a microphone is needed.To understand the instruction and carry it out, the audio is analyzed using voice recognition technology.The built-in mic in HP w100 webcam was used as input for the conformer architecture.Using a facial recognition algorithm, the individual may be identified from this continuous stream or video from the camera.The experimental study has been conducted to demonstrate the performance of the proposed VAFR framework, and the results are depicted in figures 4 to 7.
The figures 4 to 7 show examples of successful face detection using bounding boxes, names as per the voice command and distance measurements (in centimeters), as shown in figures 4 and 5.In the presence of multiple individuals in the picture, it is noteworthy that only the faces trained to the Viola-Jones algorithm were   OpenCV-Python It is a popular library used for computer vision applications, such as facial recognition.On a Raspberry Pi, this library may be installed and used to find faces in live video broadcasts.

2.
VNC viewer A VNC viewer offers a user-friendly interface for interaction and administration when connecting to and controlling a Raspberry Pi's graphical desktop remotely from another computer.
recognized, as shown in figures 4 to 7. Figures 6 and 7 demonstrate that the framework is accurate in detection even in bad lightnings.Overall, this method shows how successfully face recognition technology has been incorporated into the proposed system, detecting people in real-time video feeds based on voice commands.The accuracy of the VAFR framework is given in table 4. In presented experiments, the Viola-Jones method, which is well-known for its real-time face identification skills, showed remarkable accuracy.When this model was used in our VAFR framework, it was able to recognize faces from the video feed with a 99.4% accuracy rate,   The combined VAFR framework not only makes use of the advantages of the Conformer and Viola-Jones models, but also creates a synergy that improves accuracy over the individual components.The significance of multi model recognition systems in improving security and operational effectiveness in real-time surveillance applications is highlighted by this integrated approach.

Comparison analysis
Table 5, present the comparison of various face recognition systems with the proposed VAFR framework.The Viola-Jones [30] technique has a 93.9% accuracy rate and is well-known for its ability to distinguish faces among other generic objects.With a slightly better accuracy of 94.85%, Islam et al concentrated on creating assistive technology for those who are visually impaired [12].With a 99.06% accuracy rate, Salama AbdELminaam et alʼs [13] study uses DCNN for a variety of purposes, suggesting its usefulness in forensic and security applications.The proposed VAFR framework uses both the Viola-Jones algorithm for face detection and the Conformer architecture for voice recognition, in contrast to Ahmed et al [14], which concentrated only on facial recognition using the Eigenface algorithm for security and access management with a claimed accuracy of 99.63%.In realtime surveillance applications, proposed dual-modality method improves accuracy and user engagement by enabling a more flexible and intricate system.Voice commands are used to start face recognition algorithms, giving consumers a simple hands-free method of interacting with the surveillance system.Speech recognition using the Conformer architecture, which is well-known for controlling local and global dependencies in audio data is integrated.This ruling contradicts Ahmed et al [14], which primarily focused on the visual aspect of recognition.The proposed framework reaches an impressive 99.8% accuracy rate in real-time video streams, which is a slight but significant efficiency boost that is especially useful in busy or dynamic surveillance settings.Even while adding a new modality to the surveillance system increases complexity and could use more energy, there are several reasons why the little accuracy gains are significant.Firstly, The dual-modality strategy reduces the possibility of false positives, which is important in surveillance systems because mistakes can have a large financial impact.Secondly, By utilizing effective algorithms and processing methods, proposed framework maximizes energy usage.For example, the computational efficiency of the Conformer architecture offsets the higher energy expenses related to speech recognition.An innovative and significant improvement over the cited work is provided by the VAFR framework's addition of voice recognition to augment face identification in crowd surveillance systems.Thanks to rigorous algorithm selection and system design considerations, the somewhat higher accuracy is accomplished without excessive energy expenditures.Although proposed system had the highest accuracy, it's crucial to take other elements like the datasets, system setup, and experimental design into account in order to completely evaluate these systems' performance.

Discussion
Using face and speech recognition techniques, the suggested strategy has demonstrated promise for real-time identification of persons.For face detection, the Viola-Jones technique performs well whereas the conformer architecture performs well for speech recognition.This research shows that combining speech and face recognition leverages the advantages of both techniques to improve accuracy in a variety of scenarios, including low light, distance, partial face coverage, and low contrast as it can be seen in figures 6 and 7.This integration works well together because voice recognition acts as a first filter, guiding and focusing the focus of the facial recognition process.This is particularly helpful when visual confirmation alone may not be accurate.Because of this collaborative effect, both procedures work in collaboration to increase the overall robustness and dependability of the crowd monitoring system rather than independently.The approach may have shortcomings where a different person may get recognized whose facial traits are extremely similar to the original face in the datasets due to factors such lighting, camera angles, facial emotions etc. Speech recognition can also be affected due to reasons like background noise, speech tempo, and accents, which may influence the algorithms' accuracy.Using state-of-the-art hardware and algorithms, increasing the datasets, and resolving difficulties with lighting and background noise will all help the system be more accurate and reliable.

Ethical and safety aspects of VAFR
Biometric data collection and storage present privacy concerns that must be addressed by appropriate data protection and privacy laws.There may be ethical issues, such as security problems and the possibility of data exploitation and misuse.Before using a face recognition system, it is important to carefully assess its implications for privacy and morality.It is concerning that the use of facial recognition technology for surveillance results in privacy invasion and human rights violations.In order to ensure responsible and ethical use of the technology as well as to protect people's rights and interests, it is crucial to take these ethical concerns into account and address them when developing and implementing systems that combine face detection and identification with voiceactivated commands.The privacy and ethical ramifications of utilizing a face recognition technology must be thoroughly considered before implementation.Concerning, the invasion of privacy and violation of civil rights is caused by the employment of face recognition technology for surveillance.Therefore, it is crucial to address moral concerns and ensure that the system abides with privacy laws and regulations.

Conclusion and future scope
For the development of sustainable smart cities and society, the safety and security of people is important.This study introduces a Voice-Activated Face Recognition (VAFR) framework that integrates face and speech recognition technologies for in-the-moment identification of people, with a special use in crowd monitoring toward development of sustainable smart cities and society.The proposed VAFR framework uses Conformer architecture for audio analysis, a sort of speech recognition technology.For face detection, the Viola-Jones algorithm is used, with positive outcomes.In real-time video streams, the algorithm can accurately identify people 99.8% of the time.The framework, however, has limits in situations where misidentification may result from comparable face characteristics, lighting, camera angles, or facial expressions.Accents, speech pace, and background noise may all have an influence on speech recognition, which affects how accurate the algorithms are.Future work might concentrate on expanding the datasets, solving issues with illumination and background noise, and utilizing cutting-edge technology and algorithms to increase accuracy and dependability.
This technique has a lot of promise for use in crowd surveillance.It can improve security measures, make crowd control easier, and help identify people in real-time video feeds.However, it is essential to consider the privacy and moral implications of facial recognition technology.It is crucial to strike a balance between security requirements and privacy concerns, making sure that the system is implemented in accordance with privacy laws and regulations.This study demonstrates how facial and speech recognition technology may be successfully used for crowd monitoring.With further developments and the system's high accuracy rate, security and monitoring in busy areas might be greatly improved, creating safer public spaces.

Figure 1 .
Figure 1.The role of CSS in the development of sustainable smart cities and society.

Figure 3 .
Figure 3. Flowchart of working of Viola-Jones algorithm.

Figure 4 .
Figure 4. Dataset 1 detection with name and distance from the camera.

Figure 5 .
Figure 5. Dataset 2 detection with name and distance from the camera.

Figure 6 .
Figure 6.Detection in bad lighting conditions.

Figure 7 .
Figure 7. Detection of alternative dataset in bad lighting conditions.

Table 1 .
Datasets with different poses and bad lighting.

Table 5 .
Comparison of various face recognition systems with the proposed system.to different lighting conditions and viewing angles.With 100% accuracy, the Conformer model processed speech instructions.It was created for effective voice recognition.This performance highlights the model's capability to reliably convert spoken instructions into useful inputs in spite of fluctuating speech patterns and background noise.