Geomagnetic field fingerprint from wristband and smartphone clusterization for indoor localization

Nowadays, many wearable and smartphones have their built-in sensors that can be used to obtain the information about location to make an Indoor Localization Systems (ILS). It could be helpful on developing Smart City Systems, where ILS can handle the information collection about space and placement inside a room or building. In this research, we used the datasets from UCI Machine Learning Repository, which consists data of geomagnetic field fingerprint captured by both smartphone and smartwatch sensors from certain locations with timestamps provided, and then followed by clusterization processes to determine whether each device could effectively used for ILS based on radiation emitted in the same environment. For the methods, we used K-means (for k=2 to k=10) and Hierarchical Clustering. After several experiments, hierarchical clustering works best to cluster the data since the outliers problem doesn’t occurred in this algorithm, and every instance get clustered with high homogenity for each member and every clusters differs enough in characteristics. Newer device with more sensor needed for next research.


Introduction
Smart city has been a hot issue today. As the smart city concepts sprouted massively around the world, the needs of newer and advanced technology to accomplish most of the concepts becomes clearly mandatory, and the proof of it could be seen at mobile devices. The development of mobile devices has been increased significantly in last decade. There are a lot of technology enhancements and innovations implemented in today's mobile devices structures. The improved technology leverages the human habit in using mobile devices. For example, the uses of smartphone today are not limited to make a call and send a message. People can use their smartphone to play games, browse the internet, access the social media, etc. One of the uses of smartphone that can easily found in ever day life is navigation.
Almost every smartphone produced in last decade have Global Positioning System (GPS) built in their machines for navigation purpose. GPS is useful when we need a guidance to a specific location. But GPS has some sort of weaknesses. Its accuracy to navigate in indoor or underground environment could be decreased because of satellite signal blockage. It does have an impact to create an Indoor Localization System (ILS). The Indoor Localization Systems (ILSs) is a system that could be used in Ambient Assisted Living (AAL) scenarios [1], robotic applications [2], and indoor navigation in large environments such as malls, campuses, and airports [3]. Due to GPS' weakness, the community of makers started to build alternatives which can be used to solve the indoor localization problem. As one of the available alternatives, some modern ILSs are based on the use of a variety of sensors and devices that are embedded in smartphones [4].
This study presents related researches of indoor localization methods based on geomagnetic field fingerprinting [5] which the radiation record will be clustered to find out whether this method is effective enough or not to solve indoor localization problem. This research uses k-means clustering and hierarchical clustering to find out the best number of clusters with different algorithm. With this research, it is hoped that more detail can be found on which algorithm more suited to solve the problem and how well the clustered data to be used for analyzing the method. This method builds a model based on training data [6]. Furthermore, the model is used to predict test data [7]. With this research, it is hoped that more detail can be found on which algorithm more suited to solve the problem and how well the clustered data to be used for analyzing the method.

Method
The process of partitioning or grouping a given set of patterns into disjoint clusters is known as clustering [8]. In this research, we used two kind of clustering algorithm, K-means clustering and hierarchical clustering.

K-means clustering
Big data sets can be clustered by using K-means clustering algorithm. K data elements are selected as initial centers [9]. The distances of all data elements are calculated by Euclidean distance formula. Continue the process with moving data elements to the appropriate cluster which having less distance to centroids. K-means algorithm is used for clustering. It is a type of unsupervised learning where there is no idea about the class or labels for any data and need to discover the clusters without this information [10].

Algorithm.
Algorithm of K-means clustering [11]: 1) Select the number of 'c' cluster centers. 2) Find out the distance between each data point and cluster centers using Euclidean distance formula: where i = 1 to n. 3) Assign the data point to that cluster center whose distance from the cluster center is minimum as compared to all the cluster centers. 4) Recalculate the new clusters center using: ci represents the number of data points in i th cluster 5) Recalculate the distances of each data point with new cluster centers. 6) If there is no reassigning of the data points, then stop, otherwise repeat step 3 onwards.

Hierarchial Clustering
One method in hierarchical clustering is The Ward method. The Ward method iteratively merges two clusters at a time, making sure the merger will increase the total within-cluster variance by the minimum possible [12].  (1), creating the new cluster S(Si ∪Sj ). Remove references to the old clusters Si and Sj, as well as their centroids c(Si) and c(Sj).
3) Set the centroid of the new cluster to the cluster's center of gravity. 4) Reduce K in 1, if K is still bigger than the desired number of clusters go back to Step 2.

Methodology
To find which algorithm has best clustering result in term of likelihood and errors, this paper used Geomagnetic Field Fingerprint dataset from UCI Machine Learning Repository which can be used publicly to be clustered using k-means clustering and hierarchical clustering.

Dataset description
The mentioned authors used a smartphone and a smartwatch to capture the geomagnetic field fingerprints from an environment by two users. The data acquisition process involved two campaigns performed at the first floor of the Institute of Information Science and Technologies (ISTI), inside the Italian National Council (CNR) building. The data acquisition campaign has been performed by wearing two devices simultaneously: a smartphone and a smartwatch. The used smartphone model is the Sony Xperia M2, while the smartwatch is the LG W110G Watch R. Both devices were running the Android OS with dedicated apps developed to collect the data [13]. Data gathered during the study comprises both physical parameters and Wi-Fi access points information, but this paper won't use the data from Wi-Fi access point. For the details about the dataset acquisition, please kindly refer to the related paper because this paper will use only the dataset without detailing the origin of the data. The dataset contains: There are four datasets, with each dataset record the different data. First dataset recorded the first campaign using phone, second dataset recorded the first campaign using watch. The first campaign was performed by one person using phone and watch at the same time. Third dataset recorded second campaign using phone, and forth dataset record the second campaign using watch. The second campaign was performed by one person using phone and watch at the same time. First campaign took the place first, later followed by second campaign, so each instance in their respective dataset have unique timestamp.

Preprocessing mechanism
Every record in the dataset are numerical. Total attribute for each dataset is 13 attributes, with the details as mentioned in previous section. Each dataset has similar parameters (attributes), but the smartphone didn't record the gyro data. Table 1 shows the parameters collected for each device, and table 2 shows the characteristic of the dataset.  Because of different number of instances with phone device between first campaign and second campaign, feature selection is performed to make both campaigns have same number of instances. Due the lack of hardware resources which has an impact on processable dataset size, every dataset must be uniformed in number of instances. As the result, the number of instances in first phone campaign is reduced by 568, and the number of instances in both watch campaigns is reduced to 17786. Detail can be found in table 3. Next process linked to dataset multivariate characteristic. Each attribute category (i.e. Accelerometer, including its X, Y, and Z) have their own measurement unit. This raise a new problem as the dataset couldn't be processed to the next step (clusterization) without standardize/transform it first. Assume the dataset has gaussian distribution (bell curve), then the data should be transformed into z_score which being calculated by equation below: X represent the original score, X ̅ represents the mean of sample distribution, n represents the number of samples, and SDX represents the standard deviation. Then, the dataset value for each instance being replaced with the new z score . This step is performed to every dataset available for this paper. Table 3 shows the example of record in original dataset, and table 4 shows the example of record in dataset after being standardized.

Clusterization mechanism
After the dataset has been standardized, the next process is clustering. The algorithms used are K-means clustering and hierarchical clustering, both performed with python-based software in a machine with this specification: K-means clustering performed on all dataset, followed by hierarchical clustering on the same subjects. The analysis will be based on result, excluding the performance matter as time complexity, process time, etc.

Results
As mentioned before, this research using K-means clustering and hierarchical clustering.

K-means clustering phase
K-means clustering performed from k=2 until k=10, which the goal is to find the best K value and how many clusters build with their respective K values. To determine the best K value, elbow point method is used to count the average value of the closest distance between each data in the cluster and their centroid, and then compare each K values.   figure 1(a) is located on number 4, figure 1(b) is located on number 5, figure 2(a) is located on number 3, and figure 2(b) is located on number 3. They become our K values for each respective dataset. After the best K value has been determined, continue with finding centroid value of each clusters available. Table 7. Centroid values from clustered first phone campaign data with K-means clustering  K-means clustering phase is completed, followed by hierarchical clustering phase.

Hierarchical clustering phase
Hierarchical clustering differs with K-means clustering in determine the number of clusters. K-means does the calculation based on determined number of clusters before the process started. Hierarchical clustering does the opposite. This algorithm doesn't need any number of clusters declared before to start the calculation. Hierarchical clustering will cluster the data, whether bottom-up or top-down, and then find the best number of clusters by itself.  From figure 3(a) and 3(b), it can be concluded both phone campaigns have same number of clusters, three clusters to be exact, and the same goes with both watch campaigns which have same number of clusters too, two clusters to be exact. All phases are completed, go the next step is cluster analysis. Table 11 shows the summary of the experiment result.  From the result, we can conclude that radiation emitted from the phone and watch were being clustered to many clusters. The cluster itself will be the basis of classification which could lead to the record being classified, for example, as weak radiation, medium radiation, strong radiation, and many more. But clusterization data itself can't be used to make a label as it need to be processed further. The clustered data can be used to determine and predict the label assume the process is continued to classification stage. That conclusion gives stronger basis to continue the research.