Improved GMMKNN Hybrid Recommendation Algorithm

This paper proposes a hybrid recommendation algorithm that combines the advantages of Gaussian Mixture Model (GMM) and K-Nearest Neighbors (KNN) algorithms. The algorithm first applies GMM to cluster the training data, grouping users with similar interests using clustering techniques. It then utilizes the KNN algorithm for prediction. During the KNN recommendation process for a target user, the algorithm searches for neighboring users with similar interests and features within the same cluster. This significantly reduces the search scope of the nearest neighbors, thanks to the assistance of the GMM algorithm. The algorithm then selects the K most similar neighbors from the target user’s cluster as the candidate pool for personalized recommendations. Additionally, the algorithm improves the weight calculation method by incorporating a Gaussian kernel function for weight estimation. Experimental results on the dataset demonstrate that the proposed improved algorithm effectively enhances the accuracy of prediction results.


Introduction
Recommender systems, an important part of information filtering and personalized services, are playing an increasingly important role in the Internet era [1][2].With the rapid development of Internet technology and the explosion of information, users are faced with a massive amount of choices.Providing accurate and effective recommendations based on users' personalized needs and interests has become a core issue in recommender system research.
Collaborative filtering algorithms [3][4] have been successfully applied to recommender systems.By analyzing user behaviors and preferences, they identify other users with similar interests and predict the target user's preferences based on the behaviors of these similar users, enabling personalized recommendations.However, traditional collaborative filtering algorithms have certain limitations when dealing with certain concerns include cold start difficulty and data sparsity.
In order to tackle these concerns, scholars have proposed the Weighted k-Nearest Neighbor algorithm (WKNN) [5][6], which introduces weight factors based on the similarity between users in addition to the traditional k-nearest neighbor algorithm, enabling more accurate recommendations.Furthermore, Gaussian kernel function, as a commonly used kernel method [7][8], has extensive applications in the field of data mining and can effectively process data and extract features.
This article proposes the GMMKNN (Gaussian Mixtures K-Nearest Neighbors) hybrid recommendation system algorithm with the aim of improving the accuracy and personalization of recommendation systems.Firstly, through GMM (Gaussian Mixture Model), the training data is clustered to assign users and items to different latent interest groups and obtain classification labels.This enables capturing the latent correlations between users and items.Then, during the

K-Nearest Neighbor Algorithm
Recommendation systems are a crucial part of internet applications.In the early days, two methods were mainly used for recommendation systems: content filtering and collaborative filtering.Content filtering involves categorizing items into different classes based on users' historical behavior and labeling information, and then recommending items by matching user interests with item features.Collaborative filtering, on the other hand, predicts which items a user might be interested in based on the similarity between users or items.Collaborative filtering can be further divided into user-based and item-based approaches.
The K-Nearest Neighbors (KNN) algorithm is a common collaborative filtering algorithm [9][10].It is based on a simple assumption that the K nearest neighbors to a sample have similar characteristics.The main idea of the KNN algorithm is to utilize a known training sample set to calculate the similarity between the target sample and the training samples.It then finds the K nearest training samples to the target sample and makes a decision based on the categories of those K samples.The choice of the K value is usually achieved through cross-validation [11].Cross-validation is a commonly used method for model evaluation and selection, which can accurately assess the model's capacity for generalization.It involves repeatedly splitting the dataset into training and validation sets and performing training and validation processes.By selecting the optimal K value through cross-validation, it is possible to avoid overfitting or underfitting problems while ensuring the accuracy and generalization ability of the model.This approach leads to a more reliable and effective number of nearest neighbors, thus improving the performance and stability of the model.
When calculating similarity, commonly used metrics include Euclidean distance, Pearson correlation coefficient, cosine similarity, etc.The metric used in this article is Euclidean distance, which is formulated as follows: Where x u and x v are two training samples belonging to users u and v respectively, x u i and x v i are the corresponding elements at their positions, and nn is the dimension of the vectors.For the task of rating prediction using movie rating data, the pseudocode for the KNN algorithm is as follow Algorithm 1: Movie Rating Prediction based on KNN

Gaussian Mixture Model (GMM)
The Gaussian Mixture Model (GMM) is a statistical model commonly used for cluster analysis and probability density estimation [12][13].It assumes that the dataset is composed of multiple components, each of which can be represented by a Gaussian distribution.The following is the definition of the GMM's probability density function: In the Gaussian Mixture Model (GMM) framework, x represents a sample, K represents the number of Gaussian components in the GMM, π k is the weight of the K-th Gaussian component, μ k and Σ k are the mean vector and covariance matrix of the K-th Gaussian component, respectively.The term (x | μ k , Σ k ) represents the Gaussian function.
The value of the number of clusters, K, in Gaussian Mixture Model (GMM) is commonly determined using the Elbow Method [14].First, different values of K are tried to fit the GMM on the given dataset.Then, the log-likelihood loss function of the model, denoted as ℒ, is computed for each K value.Finally, a curve is plotted with the number of clusters K against the corresponding loss values, and the value of K where the loss values start to level off is chosen.The parameters of the GMM model, including π k , μ k , and Σ k , are estimated using the ExpectationMaximization (EM) algorithm.In the EM algorithm for GMM parameter estimation, we first initialize the mean vectors μ k , covariance matrices Σ k , and weights π k .
E step: For each sample x i , calculate its posterior probability γ(z ik ) of belonging to the k-th Gaussian component.This is done using the current parameter estimates and Bayes' theorem, with the density function of the Gaussian distribution: where (x | μ, Σ) stands for the multivariate Gaussian distribution's probability density function.
M step: Using the posterior probabilities calculated in the E-step, update the parameter estimates for each Gaussian component: Update the mean vectors: Update the covariance matrices: Update the weights: where N k = ∑ i=1 N γ(z ik ) is the number of samples belonging to the k-th component, and N is the total number of samples.Repeat the E-step and M-step until the maximum number of iterations is reached (default is 200).The pseudo-code for clustering using the Gaussian Mixture Model (GMM) is mentioned above (Algorithm 2).

GMMKNN Hybrid Recommendation Algorithm
The paper proposes the GMMKNN (Gaussian Mixtures K-Nearest Neighbors) hybrid recommendation system algorithm, which combines the Gaussian Mixture Model (GMM) and K-Nearest Neighbors (KNN) algorithms to improve the accuracy and personalization of the recommendation system.
The main idea of the algorithm is as follows: Firstly, the training dataset is analyzed using GMM for clustering.Users and items are assigned to different latent interest groups and corresponding labels are obtained.This clustering method can capture the underlying correlations between users and items,  Next, in the recommendation process for a target user, we first find all users in the same cluster label as the target user.The purpose of doing this is to better explore users with similar interests and characteristics, thereby increasing the accuracy and reliability of the recommendation results.
Then, we use the KNN algorithm to find neighboring users within this user set.The KNN algorithm determines the neighbors by calculating the distance or similarity between other users and the intended users and selects the top K most similar users as the candidate set.This enables personalized recommendations because the users in the candidate set are more likely to have similar preferences and interests to the target user.
Finally, we use a weighted prediction formula to calculate the recommendation results.By combining the distances between neighboring users and the weights based on their respective interest groups, we can assign a predicted value to each item.This weighted approach can better balance the impact of different factors on the recommendation results, enhancing the personalization and accuracy of the recommendation system.The pseudocode for rating prediction using the GMMKNNiw algorithm is as follows (Algorithm 3):

GMMKNN with Gaussian Kernel
The Gaussian kernel function [15][16] is a commonly used kernel method with wide applications.It calculates the similarity between samples for data processing and feature extraction.The Gaussian kernel function has a significant impact on data mining, pattern recognition, and machine learning.Classification algorithms such as Support Vector Machines (SVM) utilize the Gaussian kernel function to construct nonlinear decision boundaries, improving the performance and adaptability of the classifier [8].In the field of machine learning, the Gaussian kernel function is widely used in Support Vector Machines (SVM) and other kernel-based algorithms.It performs a nonlinear mapping by computing the similarity between samples, thereby mapping the samples from the original input space to a higher-dimensional feature space.The Gaussian kernel function enables nonlinear mapping of data, allowing for capturing more complex similarities among the data.Compared to simple linear similarity measures, the Gaussian kernel function can better handle nonlinear relationships.In traditional similarity measures, distortion may occur if there are significant differences in feature scales.However, the Gaussian kernel function calculates similarity based on the proportional separations between characteristics, therefore resolving the scale disparity issue.
Considering the advantages of measuring similarity using the Gaussian kernel function, this paper replaces the traditional inverse-weighted prediction based on Euclidean distance with Gaussian kernel similarity measure.The formula for calculating the Gaussian kernel function is as follows: (7) where x u and x v represent two samples, i.e., users u and v, ∥ ∥x u − x v ∥ ∥ represents the Euclidean distance between them, and σ is the bandwidth parameter of the Gaussian kernel function.
The GMMKNN algorithm with Gaussian kernel utilizes the Gaussian kernel function for rating prediction.The pseudocode for this algorithm is as follows(Algorithm 4):

Dataset and Evaluation Metrics
In this paper, we selected the classic ml_latest_small dataset for our study.This dataset contains approximately 100,000 movie ratings, including information on around 900 movies and 600 users.This dataset is frequently utilized in the fields of machine learning and recommender systems.Before conducting the experiments, we divided the dataset into a training set and a test set in a 3:7 ratio.We trained the model using the training set and evaluated its performance on the test set.
We used Mean Squared Error (MSE) and R-Squared to evaluate the differences between predicted values and true values, as well as the goodness of fit of the model.A smaller MSE indicates more accurate predictions, while an R-Squared value closer to 1 indicates a better fit of the model.
MSE is calculated by taking the average of the squared differences between predicted values and true values.A smaller MSE value indicates smaller differences between predicted values and true values, implying more accurate predictions.The calculation formula for MSE is as follows: 8) Here, r ui represents the true rating, r ˆui represents the predicted rating, and n is the size of the test set.R-Squared takes on values between 0 and 1 , with a value closer to 1 indicating that a greater percentage of the variance in the target variable can be explained by the model.The calculation formula for R-Squared is as follows: Here, r ui represents the true rating, r ˆui represents the predicted rating, n is the size of the test set, and r ‾ represents the average rating in the test set.

32
Get the rating vi r given by user v for movie i ;

Experimental
We compared three methods: the traditional distance-weighted KNN algorithm Knniw, the GMMKNN hybrid recommendation algorithm based on Gaussian kernel function GmmKnnkw, and the distance-weighted GMMKNN hybrid recommendation algorithm GmmKnniw.Firstly, we analyzed the dataset and determined the optimal number of clusters using the elbow method (Figure 1).From Figure 1, we can see that the optimal number of clusters is K = 3.This is because from K = 3 onwards, the loss value starts to become relatively flat.In this study, we used kernel density estimation to describe the distribution of the data and selected an appropriate Gaussian kernel bandwidth.In Figure 2, we plotted the histogram of the dataset and attempted to fit the data using different bandwidth values.The results showed that when the bandwidth was chosen as σ = 0.1, it provided a better fit to the data and accurately reflected the distribution characteristics.Based on the data presented in Figure 3, we can draw some conclusions to compare the predictive performance of the GmmKnnkw algorithm and the Knniw algorithm.First, observing Figure 3 (left), we can see that the MSE values of the GmmKnnkw algorithm are generally lower than those of the Knniw algorithm.A lower MSE value indicates that the predicted values are closer to the true values.Therefore, it can be inferred that the GmmKnnkw algorithm performs better in terms of predictive performance.Continuing to observe Figure 3 (right), we can see that most of the R-Squared values for the GmmKnnkw algorithm are positive, while the R-Squared values for the Knniw algorithm are relatively lower.R-Squared value is an indicator of model fit, and a positive value indicates that the model fits the data relatively well.Therefore, from the perspective of R-Squared values, we can also conclude that the GmmKnnkw algorithm is superior to the Knniw algorithm.Considering the above analysis, it can be concluded that based on the data in Figure 3, the GmmKnnkw algorithm outperforms the Knniw algorithm in terms of predictive performance.The comparison of the data in figure 4 reveals the difference in predictive performance between algorithm gmmknnkw and algorithm gmmknniw.Firstly, from figure 4 (left), it can be observed that the mse value of algorithm gmmknnkw is generally lower than that of algorithm gmmknniw.A lower mse value indicates that algorithm gmmknnkw is closer to the true values, suggesting better predictive performance.Additionally, through the analysis of figure 4 (right), it can be observed that most of the r-squared values for algorithm gmmknnkw are positive, while the r-squared values for algorithm gmmknniw are lower than those for algorithm gmmknnkw.Based on the above analysis, we can conclude that, according to the data in figure 4, algorithm GmmKnnkw outperforms algorithm GmmKnniw in terms of predictive performance.Based on the data in Table 1 and Table 2, we can compare the predictive performance of different models with different numbers of neighbors.From the data, it can be observed that algorithm GmmKnnkw performs better than algorithm GmmKnniw and algorithm Knniw.
First, looking at the MSE values in Table 1, we can see that the MSE value of algorithm GmmKnnkw is significantly lower than those of algorithm GmmKnniw and algorithm Knniw.A lower MSE value indicates that algorithm GmmKnnkw's predicted values are closer to the true values, demonstrating higher prediction accuracy.
Additionally, from the R-squared values in Table 2, it can be observed that algorithm GmmKnnkw has higher R-squared values compared to algorithm GmmKnniw and algorithm Knniw.R-squared value is a metric used to assess how well a model fits the data; greater R-squared values correspond to better model fit.It can be observed from the data that the difference in MSE values between algorithm GmmKnniw and algorithm Knniw is not significant, suggesting that directly applying GMM to the traditional distance inverse weighted KNN method alone does not yield remarkable improvement.However, in this study, we introduced the Gaussian kernel function, which effectively overcomes this issue and significantly enhances the algorithm's performance.By combining the Gaussian kernel function with KNN, algorithm GmmKnnkw exhibits superior predictive performance compared to algorithm GmmKnniw and algorithm Knniw.In this work, the ideal number of nearest neighbors, K, was determined using a 10-fold cross-validation procedure.We tested various K values between 2 and 50 and assessed each one's performance using the training set.Through a comparison of the model's performance at various K values,we determined that the optimal number of nearest neighbors is K=23 (see Figure 5).This result was obtained through iterative validation and tuning on the training set, demonstrating good generalization performance.Figure 6 shows the mean squared error (MSE) values of different methods for a given k=23.From the bar chart, it is evident that compared to other methods, the GmmKnnkw algorithm exhibits significantly lower MSE values when the optimal k value is used.Further analysis indicates that it also has higher prediction accuracy.7 compares three different methods (GmmKnnkw, GmmKnniw, Knniw) for prediction errors across all nearest neighbor k values using a box plot.A box plot is a commonly used statistical chart that shows the distribution of data and outliers.For the Knniw method, the box plot shows a median of 0.0517 , with upper and lower quartiles at 0.0536 and 0.0488 respectively.The height of the box is relatively large, indicating significant variability and dispersion of the prediction results across the entire dataset.This suggests poor prediction accuracy and lack of consistency for certain data points in this method.
In the GmmKnniw algorithm, the median of the data is approximately 0.0519 , with upper and lower quartiles around 0.0538 and 0.049.The height of the box is relatively small, indicating a narrower range of data distribution and most data points concentrated within the box.This suggests that the GmmKnniw method has better prediction consistency and stability compared to the Knniw method.
In the GmmKnnkw algorithm, the median of the data is approximately 0.0467, with upper and lower quartiles close to 0.046 and 0.0458 .The height of the box is small, indicating a high level of data concentration and good consistency in the prediction results.Compared to the previous two methods, the GmmKnnkw algorithm exhibits smaller prediction errors and better stability across the entire dataset.
In conclusion, based on the analysis from the box plot, the GmmKnnkw algorithm performs better than the GmmKnniw and Knniw algorithms in terms of prediction accuracy and stability.This indicates that the GmmKnnkw method can provide relatively consistent and reliable prediction results.

Conclusion
The combination of GMM and KNN algorithms can improve recommendation accuracy, enhance personalization, and increase data utilization.This hybrid recommendation system uses GMM for clustering analysis and KNN for nearest neighbor search to identify the similarity in user interests, resulting in more accurate recommendation results.Similarly, personalized recommendations are strengthened by considering the similarity among users and interest groups.Furthermore, through GMM's clustering analysis, the system can effectively utilize relevant information in the data, improving the effectiveness and data coverage of the recommendation system.Therefore, a hybrid recommendation system that combines GMM and KNN algorithms can provide users with a better recommendation experience.
However, there are also areas for improvement in this system.Firstly, in addressing the cold start problem, the system may not accurately recommend to new users or new items.Secondly, as the number of users and items increases, the hybrid recommendation system needs to have real-time capabilities and scalability to handle large-scale data and update recommendation results timely.In addition to accuracy and personalization, diversity of recommendation results should also be taken into consideration.Finally, it is important to further research and utilize user feedback information to adjust and optimize the algorithms of the hybrid recommendation system.

Input:1between target user u and other users； 2
Training set D , target user u , number of neighbors K Output : Rating prediction ui rˆ for target user u Calculate Sort the distances and select the closest K neighbor users )

Algorithm 2 :2
Movie Gaussian Mixture Model Clustering Input : Training set D ,number of Gaussian components K Output : Clustering results C 1 Initialize Gaussian mixture model; Initialize mean vectors u and covariance matrices ∑ , weights π ; component k ;

Algorithm 3 :
GmmKnniw Algorithm Input : User-movie dataset D , target user u , number of clusters K Output : User's rating for movies r 1 Initialize mean vector u 、covariance matrix ∑ 、weights π ; step:; 9 for each Gaussian component k do 10 Update weight k π ; 11

Algorithm 4 :1
GmmKnnkw AlgorithmInput : User-movie dataset D , target user u , number of clusters K Output : User's rating for movies r Initialize mean vector u 、covariance matrix ∑ 、weights π ;

Figure 3 .
Figure 3.Comparison of MSE and R-Squared values.

Figure 4 .
Figure 4. Comparison of MSE and R-Squared values.

Figure
Figure7compares three different methods (GmmKnnkw, GmmKnniw, Knniw) for prediction errors across all nearest neighbor k values using a box plot.A box plot is a commonly used statistical chart that shows the distribution of data and outliers.For the Knniw method, the box plot shows a median of 0.0517 , with upper and lower quartiles at 0.0536 and 0.0488 respectively.The height of the box is relatively large, indicating significant variability and dispersion of the prediction results across the entire dataset.This suggests poor prediction accuracy and lack of consistency for certain data points in this method.In the GmmKnniw algorithm, the median of the data is approximately 0.0519 , with upper and lower quartiles around 0.0538 and 0.049.The height of the box is relatively small, indicating a narrower range of data distribution and most data points concentrated within the box.This suggests that the GmmKnniw method has better prediction consistency and stability compared to the Knniw method.In the GmmKnnkw algorithm, the median of the data is approximately 0.0467, with upper and lower quartiles close to 0.046 and 0.0458 .The height of the box is small, indicating a high level of data concentration and good consistency in the prediction results.Compared to the previous two methods, the GmmKnnkw algorithm exhibits smaller prediction errors and better stability across the entire dataset.In conclusion, based on the analysis from the box plot, the GmmKnnkw algorithm performs better than the GmmKnniw and Knniw algorithms in terms of prediction accuracy and stability.This indicates that the GmmKnnkw method can provide relatively consistent and reliable prediction results.

18
Generate final clustering results; 19 for Each sample i x ; do 20 Assign it to the Gaussian component k with the highest posterior probability;