Evaluation of location-data based features using Gaussian mixture models for age group estimation

Several studies have estimated the demographics and behavioral patterns of users of mobile devices, such as smartphones, using a variety of information obtained from such devices. However, most studies have estimated unknown demographics by correlating the geographical information of users with their mobile device usage histories and social networks. In such cases, significant costs are incurred in preprocessing the data before building an estimation model. Therefore, in this study, we verified whether user demographics can be estimated using only location data obtained from mobile devices. We constructed a machine-learning model that classifies user age groups into two classes, young and elderly, based on the input features generated from location information using a Gaussian-mixture model. By measuring the classification performance of the constructed model, we confirmed that location information contained the information necessary for user attribute estimation. Experimental results confirmed that the classification model constructed based on location information exhibited high classification accuracy for the two classes of equally sampled age groups. These findings indicate that location data contain the necessary information for estimating user demographics.


Introduction
With the spread of mobile devices, such as smartphones, demand for technologies that utilize big data acquired from such devices is growing.Big data contain a variety of information, and several studies have estimated the demographics and behaviors of users using such information.For example, Wang et al. [1] analyzed logs obtained from wireless access points (APs) and proposed a method for estimating user demographics based on the prior knowledge of the social networks and AP trajectories of users.Montasser et al. [2] proposed a method to predict user demographics by analyzing tweets using Geo-tags.In addition, various studies have been conducted on demographic estimation methods based on information obtained from mobile devices [3][4][5][6][7][8].
However, these methods require prior knowledge of social networks and analysis of text posted on social network services to be combined with geographic information, which incurs significant data preprocessing costs.Hence, in this study, we examined the feasibility of estimating user demographics based only on location data obtained from a device without considering the prior knowledge of the user.We used the point-type floating population data [9].to generate input features from the location information of each device and constructed a machine-learning model to estimate the age group of the user based on these features.This study aimed to verify whether the location information acquired from a mobile phone contains information that can be used to estimate the age group of the user.We simplified the age group into two classes, young and elderly, and verified the performance of the proposed method.
Location data were assigned a daily ID and an age group.Location information was acquired every minute.However, as the acquisition timing was based on each event, such as the launch of a specific application by a user, the number of data points was inconsistent for each ID.Therefore, missing location data were complemented using a simple method, and the complemented location data were regarded as the movement trajectories of the user.The obtained movement trajectory consisted of 1,440 location data points; however, as it was unrealistic to use the as-is data as input features, some representative points were determined in the target area, and input features were generated by aggregating the location data to the representative points at each unit time.The Gaussian-mixture model (GMM) was used to generate a mixed Gaussian distribution from the location data of all IDs during the target period, and the average value of each Gaussian distribution was regarded as the coordinate of the representative point.The time spent by mobile-device users at the representative points obtained in this manner was quantified and regarded as a feature value.The generated features were trained on a support vector machine (SVM) [10,11], which is a typical machine-learning model, to construct a classification model.
The performance of the constructed classification model was verified using Narashino, a city in the Chiba Prefecture in Japan, as the target area in late September 2021.The classification model was trained using 70% of the valid data acquired during this period and evaluated using the remaining 30% data as test data.The constructed model performed better than the random selection of two classes, with an accuracy rate and an F-score of approximately 0.60 each.However, there is room for improvement in the way the features were designed.

Features for prediction of age group 2.1. Data 2.1.1. Location data
This study used location and age group data extracted from the pointtype floating population data obtained from smartphones and other mobile devices using Agoop Inc. [9].The location data were assigned a daily ID for carrier terminals, such as smartphones, and were acquired when a specific application was launched.Location information was acquired every minute, and the daily ID was reset at 0:00 every day; therefore, the activities of an individual associated with a device could only be identified for a maximum of one day.Each daily ID had a maximum of 1,440 (24 × 60) location data points per day.However, as location information was obtained from events, such as application launches, there were missing data for almost all daily IDs.Therefore, the amount of location data differed for each daily ID.
For the age attributes, to verify whether the location data contained information for estimating demographics, we considered a simple two-class classification problem in which only the young and elderly were estimated.The young group comprised those in their 20s or younger, and the elderly group comprised those in their 60s or older.

Preprocessing for location data
The location data used to classify the demographics of daily IDs must consider the movements between two points.The movement of daily IDs should be represented by 1,440 location data points per day if the unit time is one minute.However, as explained in 2.1.1,the number of daily location data d for all daily IDs is d <= 1440; therefore, the missing data need to be supplemented.
Consider the case in which an ID whose data are acquired at location A at time t 1 and acquired at location B at time t 2 , where t 1 < t 2 .If t 2 − t 1 > 1, the ID contains missing location data between t 1 and t 2 .In such cases, it is necessary to complete the data and make assumptions regarding completion.The following assumptions are made.
(a) A user remains at location A for t min from t 1 and then starts moving and arrives at location B at t 2 .
(b) A user leaves location A at t 1 and remains at location B for t min from arrival to t 2 .(c) A user leaves location A at t 1 and arrives at location B at t 2 .
If assumptions (a) and (b) are adopted, then time t spent at either location A or B must be estimated.If assumption (c) is adopted, there is no need to estimate the time spent at points A and B, and the line segment connecting them to a straight line can be regarded as a complementary path.Assumptions (a) and (b) require the estimation of the length of stay to be repeated to estimate missing data, and the data obtained may vary considerably depending on the method used to estimate the length of stay.Therefore, assumption (c) was adopted in this study, and the travel path was complemented by dividing points A and B into (t 1 − t 2 )/t unit , where t unit is the time unit.

Summarization of target region using GMM
The location data used in this study were expressed as two-dimensional coordinates of latitude x lat and longitude x lon at a point in continuous space.As the coordinates were in continuous space, x lat , x lon ∈ R, it was unrealistic to treat the coordinates as input features.Therefore, several representative points were defined in the target space, and the location data scattered in the continuous space were discretized by aggregating them into representative points to generate the input features.
It is preferable to use representative points with high population densities, such as significant stations and commercial facilities in the target area; however, this method requires the analysis of regional characteristics every time the target area changes, and a general-purpose framework cannot be constructed.Therefore, the method proposed in this study used GMM to summarize a continuous space into several representative points.GMM is a clustering method that represents a dataset scattered in space by the superposition of n Gaussian distributions.
As the location data are in a two-dimensional space, a Gaussian-mixture distribution is generated by the superposition of bivariate Gaussian distributions as follows: where µ and Σ are the mean vector and variance-covariance matrix, respectively, and x = (x 1 , x 2 ) ⊤ .The Gaussian-mixture distribution is defined as where I is the set of bivariate Gaussian distributions, and µ i , Σ i are the mean and variancecovariance matrices in the i-th Gaussian distribution, respectively.In addition, π i indicates the mixing coefficients representing the weights of the ith Gaussian distribution and |I| = n.The mean µ i , i ∈ I of each of the n bivariate Gaussian distributions obtained in this manner is regarded as the coordinates of the representative point, and these points approximately represent the target space.The number n of superimposed Gaussian distributions is a hyperparameter; however, in this study, this number minimizes the Bayesian information criteria (BIC) within a given range of n.

Nearest neighbor representative points for location data
The time-series location data processed based on Section 2.1.2can be regarded as the movement trajectory associated with the daily ID.Let the unit time be one minute, the movement trajectory of a daily ID contains 1,440 location data points because 24×60 = 1, 440 minutes.As these points existed in continuous space, their utilization as input features was unrealistic.Therefore, the input features were generated by aggregating these 1,440 data points into representative points obtained by GMM.In particular, we calculated the nearest neighbor representative point for 1,440 coordinates, and the nearest neighbor point was assigned to the point that represented the location data.The number of times the location data were assigned to each representative point, that is, the stay time, was stored.This operation generated feature vectors g k ∈ R n , ∀k ∈ K from the 1,440 data points scattered in the continuous space of a daily ID k, where K is the set of daily ID k.
Let x k,t = (x lat k,t , x lon k,t ) ⊤ be the coordinates obtained at a certain time t for the daily ID k, and let z i , i ∈ I be the cluster variable.If z i = 1, sample x k,t belongs to the Gaussian distribution i ∈ I and 0 otherwise.Then, the probability that z i = 1 is defined as p(z i = 1|x k,t ), and from Bayes' theorem, , is obtained.As p(x k,t |z i = 1) is the probability of sample x k,t being on the ith Gaussian distribution, we obtain by Furthermore, as Eq.1 is the conditional probability that z i = 1 for sample x k,t , the i that maximizes Eq.1 obtained by can be regarded as a cluster to which sample x k,t belongs.Thus, the 1,440 location data points for each daily ID k were transformed into an ndimensional feature vector g k to generate a training dataset.Note that ∑ i∈I g i = 1, 440 where g i is the element of g k .

Experimental method 3.1. Construction of dataset
The target region for this experiment was Narashino, Chiba, Japan (whose area and population were 20.97 [km 2 ] and 76,197, respectively, as of 2020) [12,13].The point-type floating population covered in this study was limited to data obtained from Narashino City, and data obtained from outside the target area were excluded.However, since missing location data were linearly complemented, as shown in Section 2.1.2,there were cases in which complementary data existed outside the target region because the region was nonconvex.

IC-MSQUARE-2023 Journal of Physics: Conference Series 2701 (2024) 012070
The data covered by this experiment spanned from September 20 to 30, 2021.In Japan, educational institutions, such as universities, are usually on summer vacation until mid-September, and this period was excluded from the data because people's behavior changes significantly during vacations.In addition, as people's behavior during holidays changes significantly, only weekday location data were used in this experiment.
The location data samples used in the experiment were 47,003 and 18,756 for the young (20s or younger) and elderly (60s or older) groups, respectively.As the number of samples for the young age group was significantly larger than that for the elderly age group, the evaluation of the model performance would be complicated.Thus, the young age group was undersampled to 18,756.Based on 18, 756 × 2 = 37, 512 location data points, a mixed Gaussian distribution was generated using GMM.The number of Gaussian distributions n was in the range 1-50 and the distributions were computed using the expectation-maximization algorithm [14].The BIC results were calculated from the obtained GMMs; the lowest BIC was obtained with n = 45, and the mean value of these 45 Gaussian distributions was used to coordinate the representative point.The movement trajectories of each daily ID were aggregated to the 45 representative points obtained in this manner, and the feature vectors g k , k ∈ K were generated.As the feature value of the representative point is the time spent, the minimum value is 0 and the maximum value is 24 × 60 = 1, 440; however, in the experiment, they were normalized to a minimum value of 0 and a maximum value of 1.The feature vectors g k , k ∈ K and the age group labels age k corresponding to g k , k ∈ K were used as the input and output, respectively, to construct a classification model.The classification model was evaluated using 70% of the dataset as training data and the remaining 30% of the dataset as test data.The datapoints in the training and test datasets were 26, 258 and 11, 254, respectively.The test data were extracted from the original dataset by random sampling; note that the test data were sampled so that the number of samples for each label was in the same ratio as in the original dataset.

Classification model
Using the prepared dataset based on Section 3.1, a machine-learning model was constructed to classify the age groups for an unknown feature vector g = (g 1 , g 2 , • • • , g 45 ) ⊤ .This experiment used SVM as the machine-learning model, and the radial basis function was selected as the kernel function of SVM.Using SVM, adjusting the parameter C that controls the tradeoff between the error term and the regularization term of the evaluation function and the parameter γ of the radial basis function is necessary.In this study, we used grid search and cross-validation to search for these two parameters in the range of C, γ ∈ {2 −5 , 2 −4 , • • • , 2 10 }.The search resulted in C = 1, γ = 32.

Metrics
Using the trained classification model, the performance was evaluated using the test data.Evaluation metrics were accuracy, precision, recall, and F-measure, defined as follows: where TN, FN, FP, and TP are the true negative, false negative, false positive, and true positive, respectively.Given that the test data had a label distribution of 5:5 for the young and elderly age

Experimental results
Fig. 1 shows the representative points in Narashino generated based on GMM.The small dots represent the location data of all daily IDs during the target period, and the large dots are the representative points.As shown in Fig. 1, the representative points are located in densely populated areas in the target region.The representative points and location information of each daily ID were used to generate features, and the dataset was generated using the method described in Section 3.1 to construct a learned classification model using SVM.The accuracy rate was 0.70 for the training data and 0.60 for the test data.Although the accuracy rate of the test data was lower than that of the training data, it was confirmed that the criterion of 0.50 was significantly higher than that of the training data.Fig. 2 presents a heat map of the confusion matrix that visualizes the classification results of the test data using the constructed model.Fig. 2 shows that the classification performance for the young and elderly age groups is comparable and that there are more correctly classified data samples for both classes.Furthermore, Table 1 shows the accuracy, precision, recall, and F-score of the test data.The classification performance was approximately 0.60 for all the metrics.Note that, in Table 1, the value of metrics corresponds to each age group.
Based on these results, the classification model constructed in this study shows that location data contain information necessary to classify age groups.On the other hand, the classification performance could be a lot higher.Travel paths, including time-series information between representative points, contain significant information about the behavioral patterns of a person.The input features generated using the proposed method, which focuses only on the time spent at the representative points, do not include time-series information and travel paths  between representative points.Therefore, time-series relationships and connections between representative points should be considered to construct a higher-performance model.

Conclusion
This study examined whether location data acquired from mobile devices contain information necessary to classify user demographics, focusing on user age groups.In particular, we evaluated location data by constructing a machine-learning model using user location information and age groups, and evaluated its classification performance.The input features for classification were designed by representative points that summarized the target region generated using GMM and the time spent by the users at these representative points.The classification model was constructed using SVM.In the experiment, we generated a dataset with an equal number of samples from the young and elderly age groups and constructed a classification model using 70% of the dataset as supervised data.The remaining 30% of the test data were used to evaluate the classification model performance using four indices: accuracy, precision, recall, and F-score.
The results showed that all the indices were approximately 0.60, suggesting that the location data included information that can be used to estimate at least two classes of age groups: young and elderly.
In the future, we will design an input feature set that considers time-series changes and travel paths of mobile-device users.Furthermore, by improving the classification performance, we will reinforce the importance of location data in estimating demographic information.

Figure 1 .
Figure 1.Representative points generated using the GMM and raw user location information in Narashino, Chiba.

Figure 2 .
Figure 2. Heat map generated based on the confusion matrix obtained using the result of test data.