Performance Analysis Similarity Matrix, Responsibility Matrix, Availability Matrix, Criterion Matrix of Affinity Propagation

Each process in classifying data into several clusters or grouping so that the data in one cluster has a maximum similarity level and between clusters has a minimum similarity is called clustering. Clustering is divided into 2 approaches in its development, namely the partitioning and hierarchical approach to clustering[1]. The Water Quality Status dataset has 8 attributes, 4 classes and 120 instances, Class distribution is good condition (30 instances), lightly polluted (30 instances), medium polluted (30 instances) and heavily polluted (30 instances). 70% of the data will be used as training data and 30% of the data will be used as randomized test data. The simplify the process of completing the performance calculation of the clustering model, the research implementation was carried out using the MATLAB function. That iteration is carried out with the number of clusters generated from 100 to 2,500 iterations with the results of the number of clusters as many as 10 clusters. In the experiment, iteration amounted to 5000 and there was a change in the results of the number of clusters by 9 clusters. After re-testing using the number of iterations of 10,000-50,000 iterations, but the number of clusters produced did not change anything at all. So that the conclusion in testing the AP method produces the most optimal number of clusters of 10 clusters.


1.
Introduction Each process in classifying data into several clusters or grouping so that the data in one cluster has a maximum similarity level and between clusters has a minimum similarity is called clustering. Clustering is divided into 2 approaches in its development, namely the partitioning and hierarchical approach to clustering [1]. [2] in their research tried to determine the Preference value based on the normalization of the total similarity (similarity) of each data point, this value is used for all Preference values. Based on these results, there is an increase in the results of clustering when compared to the standard Affinity Propagation algorithm, which is an increase in accuracy and the value of the Silhouette Coefficient. According [3] clustering is grouping objects (data) based only on the information contained in these objects and the relationships between these objects. Data grouping is usually done based on the similarity of values between data [4]. The AP algorithm at a data point is seen in a network node, which will be processed when sending messages which are carried out by all data points in a repeated manner, thus forming a good exemplar set [5].

2.
Research Methods This study will examine performance of the AP algorithm in relation to the method of determining the Preference value. The determination of the Preference value will affect performance of the AP algorithm. Performance measurement is done by comparing the value of the Silhouettes Coefficient (Rousseeuw, 1986) generated from the standard Affinity Propagation algorithm, which determines the Preference value based on the median value of all data contained in the Similarity Matrix table with the modified Affinity Propagation algorithm..
Silhouette Coefficient is used to validate a single data cluster, or even an entire cluster. This method is widely used to validate clusters that combine cohesion and separation values. The range of values for the Silhouette Coefficient is -1 to +1. The Silhouette Coefficient value is close to 1 indicating that the data is right in the cluster, if the Silhouette Coefficient value is 0 or close to 0 then the data position is on the border of the two clusters, for the dataset = { 1, 2,… , n}, i ∈ ℛn, the value of the Silhouette Coefficient (Rousseeuw, 1986).
where, ( i) is the average distance from other data points contained in the same cluster (intra-clutser distance), ( i) is the average distance from other clusters (inter-clutser distance).

Result and Discussion
In the results carried out by researchers in implementing the AP algorithm by supporting the program. The purpose of using this AP method is to be able to analyze grouping data that is more optimal when doing clusters. Therefore it is necessary to do an analysis of the AP method.
The calculation of the Z-Score normalization for the TSS attribute (mg / L) from data to -1 is as follows: z: standard score, x: observed data, μ: mean per variable and σ: standard deviation per variable. The result of the Z-score is data with mean = 0 and standard deviation = 1 The Water Quality Status dataset has 8 attributes, 4 classes and 120 instances, Class distribution is good condition (30 instances), lightly polluted (30 instances), medium polluted (30 instances) and heavily polluted (30 instances). 70% of the data will be used as training data and 30% of the data will be used as randomized test data.
The simplify the process of completing the performance calculation of the clustering model, the research implementation was carried out using the MATLAB function. The following is the result of the Similarity Matrix value: The similarity value obtained, the similarity value taken is the negation of the sum of the squares and the distance calculation method used in this study is the Euclidean distance. The specific value used for data similarity is called preference. The preference value taken for data is 25.64. Therefore, the diagonal value in the similarity matrix above is filled with a value of 25.64. The next step is to calculate the responsibility value for the entire data and the number of iterations of 50,000 iterations. The following are the results of the acquisition of the responsibility value in the form of a matrix: The diagonal value of the Responsibility Matrix reflects the self-responsibility value of the k-th column and i-row data. To obtain the responsibility value, the similarity value in the k-th column data is reduced by the minimum value of the i-row in the similarity matrix. Here are the results of the availabile value in the form of a matrix: The availability value of the water quality data, the diagonal availability value of the water quality status data is obtained by looking at the responsibility matrix value in the column (k-k) and only adding up the positive responsiblility values (k value> 0) without including the rows with the same responsiblility value. with the (k-th) column. The next step is to add the availability value with the responsiblity value so that it can be formed into a criterion matrix. The following is the result of the criterion matrix value obtained from the water quality status data after 50,000 iterations:  The criterion value is obtained by adding the value in the availability matrix with the value in the responsiveness matrix. The highest value on the criterion matrix of each row (ith) is called an exemplar (center point). Exemplar is data that is selected from existing data to represent and from other data. The table below shows the exemplar value obtained from the application of the AP method.

4.
Conclusion In conclusion, the results show that these two parameters greatly influence the final clustering of the AP algorithm. The Water Quality Status dataset has 8 attributes, 4 classes and 120 events, class distribution is in good condition (30 events), lightly polluted (30 events), moderately polluted (30 events) and heavily polluted (30 events). The preference value taken for the Water Quality Status data is 25.64. the value of the response matrix in the column (k-k) and only adds up the positive responsibility (value k> 0) without including the row whose responsiblility value is the same as the column (k-k). The criterion value is obtained by adding the value in the availability matrix with the value in the responsibility matrix. The highest value on the criterion matrix of each row (ith) is called an exemplar (center point). The resulting exemplar value contains 10 clusters, which is the highest value that has been selected from data that is represented from other data. So that iteration is carried out with the number of clusters generated from 100 to 2,500 iterations with the results of the number of clusters as many as 10 clusters. In the experiment, iteration amounted to 5000 and there was a change in the results of the number of clusters by 9 clusters. After re-testing using the number of iterations of 10,000-50,000 iterations, but the number of clusters produced did not change anything at all. So that the conclusion in testing the AP method produces the most optimal number of clusters of 10 clusters.