Siamese Networks for Speaker Identification on Resource-Constrained Platforms

This paper investigates the implementation of a lightweight Siamese neural network for enhancing speaker identification accuracy and inference speed in embedded systems. Integrating speaker identification into embedded systems can improve portability and versatility. Siamese neural networks achieve speaker identification by comparing input voice samples to reference voices in a database, effectively extracting features and classifying speakers accurately. Considering the trade-off between accuracy and complexity, as well as hardware constraints in embedded systems, various neural networks could be applied to speaker identification. This paper compares the incorporation of CNN architectures targeted for embedded systems, MCUNet, SqueezeNet and MobileNetv2, to implement Siamese neural networks on a Raspberry Pi. Our experiments demonstrate that MCUNet achieves 85% accuracy with a 0.23-second inference time. In comparison, the larger MobileNetv2 attains 84.5% accuracy with a 0.32-second inference time. Additionally, contrastive loss was superior to binary cross-entropy loss in the Siamese neural network. The system using contrastive loss had almost 68% lower loss scores, resulting in a more stable performance and more accurate predictions. In conclusion, this paper establishes that an appropriate lightweight Siamese neural network, combined with contrastive loss, can significantly improve speaker identification accuracy, and enable efficient deployment on resource-constrained platforms.


Introduction
Speaker identification is often used in biometric security systems to determine person's identity.It works by comparing the unknown speaker's audio to the models of all enrolled speakers.The best-matching speaker is the one who is most likely to be the person speaking.Speaker identification is different from speaker verification, which only checks if the speaker's identity matches the claimed identity.Speaker identification requires N comparisons to identify a speaker from a group of N people, while speaker verification only requires one comparison.The following are the key steps in speaker identification: (1) extracting features from the audio, (2) comparing those features to a speaker model in the database of known speakers, and (3) making a decision based on the comparison.
A successful speaker identification largely depends on the model used on the speaker's voice.This model can be created using various techniques such as Gaussian mixture models (GMM), Hidden Markov models (HMM), support vector machines (SVM), or deep neural networks (DNN).Lately, DNN have been popular due to their good performance.
In a typical DNN structure, the model learns to classify data using a provided dataset and generates a prediction probability.However, to achieve high prediction probabilities, the CNN Siamese neural networks (SNN), shown in Figure 1, are particularly well-suited for speaker identification tasks because they specialize in learning the similarity or dissimilarity between two input samples [1].For speaker identification, an SNN compares the features of an unknown speech signal against reference features from a speech database, determining the best match by comparing similarity scores.Compared to other methods, Siamese networks require fewer training samples while maintaining high accuracy.They are also more robust to variations in speech signals and can handle large speech datasets.
The speaker identification process involves comparing an unknown speaker's utterance to a database of enrolled speakers.If the utterance matches the database above a certain threshold, the claim is accepted, otherwise it is rejected.The accuracy of speaker identification depends on choosing an appropriate threshold value.A low threshold can lead to inaccurate identification, while a high threshold can make identification difficult [2].
Feature extraction is essential for preprocessing speaker identification data.It reduces dimensionality by dividing the raw data into smaller, more manageable groups.This aids training by extracting critical information from speech waves while reducing model complexity [3,4].The extracted features are then input to neural networks for model training.
This paper describes an embedded implementation of speaker identification using Siamese Neural Networks.It was deployed on the Raspberry Pi 4 Model B. Due to the Pi's resource constraints, identifying suitable SNN subnetworks is crucial to ensure acceptable real-time execution.In addition to well-known lightweight networks like MobileNetv2 and SqueezeNet, MCUNet, which runs on NAS, is also used as one of SNN's subnetworks.A wake word detector is used in conjunction with the speaker identification model to meet the requirements of the real-time use case.A GUI is created using Qt, an open-source, multi-platform framework, to allow users to interact with the speaker identification system in real-time [5].

Dataset
The dataset used for training is the VoxCeleb2 open-source media dataset that consists of *.m4a files.It has more than a million utterances from more than 6,000 speakers [6].It is chosen due to its variation in noise that can mimic a noisy environment [7].Thus, the speaker identification model that is trained by using this dataset will have good noise robustness real-time scenarios.The raw data is divided into training and testing parts in the ratio 8:2 respectively.This means that 80% of the dataset is used to train the model, while 20% of the dataset is used to test the model's prediction accuracy.This process can also keep the learning model from overfitting its training data.The feature extraction process then runs on the partitioned dataset.

Feature Extraction
Feature extraction is a dimensionality reduction technique that can be used to reduce a large set of raw data into smaller groups for processing.New features will then be generated from the existing features that can be found in the original dataset [8].It is one of the essential phases of constructing the speaker identification model for extracting meaningful data from speech waves and reducing model complexity.This extracted information will be loaded into the appropriate neural network architectures for model training.
Several feature extraction techniques can be used in the speaker identification system, each with its own pros and cons.For instance, PLP (perceptual linear prediction) and MFCC (Mel-frequency cepstral coefficients) have been shown to yield superior results due to their architecture's close resemblance to human perception of voice.While LPC (linear predictive coding) is not based on the human auditory system, it is well-suited for systems that require audio communication over long distances [9].
GFCC (Gammatone frequency cepstral coefficients) is a newer technique that shares many similarities with MFCC.While MFCC uses triangular Mel-frequency filters and compresses the signal's dynamic range through logarithmic nonlinearity, GFCC emphasizes spectral valleys with a cubic root nonlinearity and employs a gammatone filter bank.GFCC is particularly beneficial for low-frequency sensitive applications like music and animal sounds.However, it can be more vulnerable to noise in specific scenarios [10,11].In general, MFCC is more commonly used in speech-related applications due to its accessibility and good performance with traditional machine learning methods.
For speech related applications using deep learning, Mel-spectrograms are better than MFCC and GFCC.Unlike MFCC, the Mel-spectrogram simplifies the process by eliminating the need for DCT computation, as shown in Figure 2.This simplification allows for the retention of the complex speech signal representation necessary for CNNs, while simultaneously reducing the overall computational burden.Although it is possible to combine multiple feature extraction techniques to leverage their respective strengths, Mel-spectrograms alone are often enough to train accurate models [11].

Loss Function
A loss function evaluates the effectiveness of a neural network in modeling the training data by comparing the difference between target and predicted output values.Based on the neural network model's loss function performance, the hyperparameters of the neural network are adjusted to achieve the lowest possible average loss score.This ensures that the neural network model can adapt to the task at hand.SNN can utilize several loss functions, including binary cross-entropy, contrastive loss, triplet loss, and constellation loss.
Binary cross-entropy is a loss function that can adapt to speaker identification purposes as it compares each predicted probability to the actual class output of false (0) or true (1), indicating whether the target is the correct speaker or not.The score is then calculated by penalizing the respective probabilities based on how far they are from the predicted value, measuring how close the predicted value is to the actual value.
The contrastive loss has been shown to outperform the binary cross-entropy loss function in SNN for speaker identification tasks, as it excels at distinguishing between two objects [12].Since the primary purpose of SNN is to compare the similarity of different instances rather than classifying a single instance, the contrastive loss is more appropriate for this scenario.It operates by bringing similar samples closer together and pushing distinct samples farther apart when presented with two input samples that are either similar or different.The calculation the contrastive loss follows the formula below.Y can be set to 0 if the samples are similar and 1 otherwise.Based on the expression of contrastive loss, the term ||x i − x j || 2 that corresponds to the Euclidean distance will be minimized in case of the samples are coming from similar groups else the term max(0, ||x i − x j || 2 ) will be minimized in case of the samples are coming from dissimilar groups [13].This is processed by the equation through the search for the correct pair of the inputs, x i and x j , so the term of ||x i − x j || 2 that corresponds to their respective Euclidean distances can be achieved.In short, the loss score is aimed to be lowered by minimizing the term of ||x i − x j || 2 for the result of the similar group while minimizing the term of max(0, ||x i − x j || 2 ) for the result of dissimilar group.
where L = Loss score Y = True output x i = Observation vector from input x j = Target vector from training dataset ||x i − x j || 2 = Predicted output m = Hyperparameter that specifies the lower bound distance between dissimilar samples.

Model Training
Siamese neural networks offer several advantages for speaker identification, notably high accuracy, invariance to environmental changes, and invariance to changes in the speaker's voice, such as age, accent, and emotion [14].Once trained, the speaker identification model based on SNN does not need to be retrained for identifying a new speaker.SNN's pattern matching analysis is performed against the reference features of the speech database, and the best match is identified by comparing the similarity scores.The training process of SNN is carried out by learning from semantic similarity to do the comparison rather than learning the features directly from the large-scale dataset to do the classification as CNN does.Therefore, SNN is robust to the dataset as it focuses on learning from the semantic similarity and only a few samples per class are required for it to learn the embeddings which place the same classes together.Several high performance neural network architectures, such as VGG16 and ResNet50, have been used as subnetworks to speaker identification models [2,15,16].However, these networks have high resource consumption and are unsuitable for deployment on embedded systems.Table 1 lists several lightweight CNNs compared to VGG and ResNet.
MobileNetv2 serves as the baseline subnetwork in our work.After feature extraction, the SNN architecture is first implemented using a customized MobileNetv2 subnetwork with no classification layer.Input spectrograms are fetched into the subnetworks, and a custom distance function is used to connect both subnetworks and measure the difference between outputs.This is followed by a flatten layer and a dense layer utilizing the sigmoid activation function, outputting values between 0 and 1.The model is trained using binary cross-entropy as the loss function and Adam as the optimizer with a learning rate of 3e-4.A batch size of 32 is used, and the model is trained for 40 epochs with 1000 steps per epoch.The training process is performed using NVIDIA GeForce GTX1080 Ti with 11GB GPU memory.
In the next phase of the work, contrastive loss was used as the loss function.Contrastive loss was found to be more suitable for SNN as it pulls clusters of points belonging to the same class closer in the embedding space while pushing away clusters of points belonging to different classes [21].Thereafter, the model is retrained using contrastive loss for more effective information labeling.
SqueezeNet is then used as a subnetwork of SNN due to its lower parameter count [19].The batch size is increased to 64 to take advantage of available GPU memory, while all other parameters remain unchanged.The model is trained for 40 epochs with 1000 steps per epoch using contrastive loss.
To further enhance the architecture, the subnetwork of the SNN is replaced with the neural network structure of MCUNet, designed to fit within 256 kB memory constraints.MCUNet optimizes accuracy, memory usage, and energy efficiency by combining the efficient neural architecture of TinyNAS with the lightweight inference engine of TinyEngine [20].The two-stage neural architecture search of TinyNAS was used without inferring the model using TinyEngine.TinyNAS creates a specialized network architecture after optimizing it to fit the resource constraints.To train the speaker identification model using TinyNAS of MCUNet, the batch size remains at 64 while other parameters remain unchanged, and the model is trained for 40 epochs with 1000 steps per epoch using contrastive loss.

Training Results
In addition to accuracy, a model's performance can be evaluated with a loss score, which measures the difference between predicted and actual values.On the MobileNetV2 subnetwork, both binary cross-entry and contrastive loss functions were used, with contrastive loss retained for the remaining experiments.Due to GPU limits, the SNN batch size was set to 32 and training converged around the 40th epoch.The same was true for the other subnetworks, so training was stopped at the 40th epoch to prevent overfitting.
Figures 4 to 7 show the training results for a SNN with MobileNetv2 as the subnetwork.The plots also compare the training results using binary cross-entropy loss and contrastive loss as the loss functions, respectively.Figures 8 and 9 show the training results using SqueezeNet as the subnetwork, while Figures 10 and 11 show the training results using MCUNet.The training results are summarized in Table 2.
In conclusion, the speaker identification model using MCUNet256kb as the SNN subnetwork achieved the highest accuracy on the Raspberry Pi 4. The SqueezeNet model demonstrated the fastest inference time.Regardless of the SNN subnetwork type, all speaker identification models can be deployed on the Raspberry Pi 4. MCUNet256kb has fewer parameters than         SqueezeNet but takes longer for inference due to its deeper network and more operations.By using MobileNetv2 as a baseline, it is possible to build a speaker identification model with either higher accuracy or faster inference time by making a trade-off between these two factors.However, the differences in each factor are not significant.This work highlights that MCUNet, a neural architecture search-based model, can be successfully executed on microprocessors with high accuracy and low resource consumption.

Embedded Deployment
The Raspberry Pi 4 Model B was selected as the platform for deploying the speaker identification model, considering its available resources [22,23].The Raspbian 64-bit operating system was installed first, followed by the libraries needed to run Python scripts.The models were trained on the GPU, and the model parameters were saved in HDF5 file format before being loaded into the Raspberry Pi for deployment.A hands-free GUI was created with the PyQt5 library to provide test inputs to the speaker identification model on the Raspberry Pi 4 [5].
To enhance accuracy and reduce power consumption, a two-stage speaker identification system is created by integrating a simple wake word detector with the speaker identification model.The first stage is the always-on wake word detector, while the second stage performs the actual speaker identification process, triggered only after the correct wake word is detected [24].The wake word detector is built using the Picovoice end-to-end edge AI platform, with"picovoice" serving as the wake word [25].Once the wake word is detected, the speaker identification stage is activated, and the result is displayed after inference is completed.

Conclusion
In conclusion, this paper presents a study on the implementation of a lightweight Siamese neural network (SNN) for speaker identification on resource-constrained platforms.The results demonstrate that SNNs, specifically using the MCUNet subnetwork, can achieve high accuracy and fast inference times on embedded systems like the Raspberry Pi.Furthermore, the contrastive loss function is found to outperform binary cross-entropy loss in the SNN for speaker identification tasks.This research establishes that an appropriate lightweight SNN, combined with contrastive loss, can significantly improve speaker identification accuracy and enable efficient deployment on resource-constrained platforms.Future work could explore the performance of SNNs with other lightweight subnetworks and investigate additional loss functions for further improvements in speaker identification systems on embedded platforms.

Figure 1 .
Figure 1.Architecture of Siamese Neural Network

Table 1 .
Selected CNN architectures for speaker identification.

Table 2 .
Summary of training results.