Modified ISR hyper-heuristic for tuning automatic genetic clustering chromosome size

Recent works using hyper-heuristics for solving clustering problems have been focusing on Genetic Algorithm. However, to the best of this research knowledge, no work is using hyper-heuristics dedicated for tuning the Genetic algorithm`s chromosome size for automatic clustering problem. The ability to tune the chromosome size is important because it allows the automatic clustering algorithm to be adaptive and dynamic. This paper proposes and evaluates a modified Improvement Selection Rules hyper-heuristic algorithm for tuning automatic genetic clustering chromosome size. The paper reviews related works of Genetic algorithm`s parameters tuning and selective hyper-heuristic algorithms and proposes a modified algorithm. The Iris, Breast Cancer, Wine and E-coli datasets are used for evaluation of the algorithm, based on the fitness, accuracy and robustness. The results indicate that the hyper-heuristic algorithm has produced good performance (fitness) and accuracy but consume considerably higher execution times.


Introduction
Hyper-heuristics is defined as the automatic design of dispatching rules using evolutionary computation and machine learning techniques [1]. Hyper-heuristics provide an alternative methodology to meta-heuristics which permit adaptive selection and/or generation of meta-heuristics automatically during the search process [2][3][4][5]. Hyper-heuristics permit the integration of two or more meta-heuristic search operators from different meta-heuristics through one defined parent heuristic via non-domain feedback, i.e. meta-heuristic to choose meta-heuristics [2][3][4][5]. The selection of a particular search operator to be used in any particular instance can be adaptively and dynamically decided based on the feedback from its previous performance [2]. With meta-heuristics integration/hybridization, each algorithm can exploit the strengths and cover the weaknesses of the collaborating algorithms [2][3][4][5][6][7][8].
At present, numerous hyper-heuristics have been developed to investigate Wireless Sensor Network (WSN), multi objective optimization, combinatorial optimization, expensive optimization, agent optimization, scheduling, fog computing, open vehicle routing, prediction, classification, pairwise testing, hybridization of nature inspired algorithms, travelling thief problem (TTP), cyber security and timetabling problem [1,5,.
1st International Conference on Science, Engineering and Technology (ICSET) 2020 IOP Conf. Series: Materials Science and Engineering 932 (2020) 012065 IOP Publishing doi: 10.1088/1757-899X/932/1/012065 2 Recent works using hyper-heuristics for solving clustering problems have been focusing on GA (Genetic Algorithm) and Particle Swarm Optimization (PSO) [32][33][34][35][36][37]. GA parameters, such as population size, its operations and maximum number of generation are tuned using hyper-heuristics by combining the advantages of two or more meta-heuristics [32][33][34][35]. However, to the best of this research knowledge, no work is using hyper-heuristics dedicated for tuning the GA`s chromosome size for automatic clustering problem. By not using the hyper-heuristic algorithm, obtaining a dynamic and optimal Genetic Clustering Algorithm for unknown number of clusters (GCUK) chromosome size from a single meta-heuristic would be difficult, as suggested by the No Free Lunch Theorem [2].
Automatic clustering methods using nature-inspired meta-heuristics such as the GA are being progressive but are facing several challenges [38]. For example, the automatic clustering methods need to be enhanced for multi-attributes optimization [38]. In addition, there are limitations such as sub-optimal clustering accuracy and overlapping clusters [39].
The goal of this paper is to propose and evaluate a modified hyper-heuristic algorithm for tuning automatic GA-based clustering chromosome size. The organisation of the paper is as follows. Section 2 includes a related work which provides approaches to tune meta-heuristics parameters and operators related to the GA. Section 3 provides a modified Improvement Selection Rules (ISR) hyper-heuristics for tuning automatic genetic clustering chromosome size. Afterward, Section 4 covers the experimental results and the discussions of the results. Section 5 gives the conclusions and the future works.

Related work
Hyper-heuristics have been proposed for GA and designed to tune the parameters and operators of the GAs [32][33][34][35][36][37]. The related works are summarized in table 1. Work Descriptions [32] Discussed self-adaptation in evolutionary algorithms where the parameters in chromosome co-evolve with the solutions.
[33] A fuzzy hyper-heuristic framework for GA parameters tuning such as population size, the probability of breeding, mutation probability and maximum number of generations.
[34] A multi-objective hyper-heuristic genetic algorithm (MHypGA) that incorporates twelve low-level heuristics which are based on different methods of selection, crossover and mutation operations.
[35] Discussed self-adaptation in the meta-evolutionary algorithm (EA) where EA used to tune GA parameters such as crossover rate, mutation and population size.
Based on table 1, hyper-heuristics have been used to tune the GA`s population size, selection, crossover, mutation processes, and maximum number of generations. However, according to the researcher`s knowledge, no work is dedicated to tuning the GA`s chromosome size for automatic clustering problem. It has been acknowledged that GA needs to be fine-tuned to specific problems and problem instances, using parameter control [33,35].
There are two types of hyper-heuristics, i.e. generative and selective [2]. A generative hyper-heuristic combines low-level heuristics (LLH) while a selective hyper-heuristic selects from a set of LLH. The  The hyper-heuristics in table 2 have been utilized with GA in the literature [2]. The Choice Function (CF) and Exponential Monte Carlo with counter (EMCQ) were among the earliest hyper-heuristics. The CF used reinforcement learning framework while the EMCQ adopted simulated annealing like a probability density function [2].
Meanwhile, the Improvement Selection Rules (ISR) and the Fuzzy Inference Selection (FIS) were the more recent hyper-heuristics. The ISR utilized Tabu search and three measures (quality, diversify and intensify) while the FIS uses fuzzy rules to accommodate partial truth allowing a smoother transition between the search operators [2,40]. The FIS and ISR actually perform better than CF and EMCQ but are significantly more time-consuming [2]. Accordingly, the FIS performance was statistically at par with the ISR [2]. However, one of the biggest challenges for FIS is to find the right fuzzy membership estimation where the design choices are often problem dependent and cannot be easily generalized [2].
The ISR Technique exploits three rules via its improvement, diversification and intensification operators [40]. The improvement operator checks for improvements in the objective function (chromosome fitness). The diversification operator determines how diverse the current and the previously generated solutions are against the population of potential candidate solutions (robustness). Finally, the intensification operator measures how close the current and the previously generated solution are against the population of solutions (accuracy). The improvement, diversification and intensification operators can be mapped to the GA chromosome fitness, robustness and accuracy respectively as the selection and acceptance rules. The mapping is shown in the subsequent section.

Proposed Modified ISR Hyper-heuristic
The hyper-heuristic algorithm is proposed for a GA-based automatic clustering technique known as the Automatic Genetic Clustering Algorithm for unknown number of clusters (GCUK) [38]. The optimal chromosome size for the automatic clustering has not been known. Therefore, the hyperheuristic is used to adaptively and automatically select the instance of the automatic clustering algorithms with different chromosome size during the clustering process, as shown in figure 1.  Figure 1. Hyper-heuristic designed to find optimal size of automatic genetic clustering [2].
According to [33], there are three important factors for hyper-heuristic algorithm design, namely problem, quality and time factors. The three important factors for hyper-heuristic algorithm have been proposed as follows:

Problem Factor
The problem is to identify the optimized size (or length) of the automatic genetic clustering chromosomes. The hyper-heuristic algorithm is designed to adaptively select the instances of the automatic clustering with different chromosome sizes. As illustrated in figure 1, the proposed hyperheuristic algorithm used several automatic clustering instances with different chromosome sizes, l as follows: 1. The hyper-heuristic algorithm initiates automatic clustering instances with chromosome size = a. 2. Automatic clustering runs population initialization, fitness computation and genetic operations. 3. The hyper-heuristic selection and acceptance operator determines if the current instance performed better than current threshold. 4. If yes, the automatic clustering continues to run until meet termination condition. If no, the hyper-heuristic algorithm initiates automatic clustering instance with chromosome size = b or c or d. 5. Repeat step 2, step 3 and step 4.
The hyper-heuristic selects the automatic clustering instances to be used at any iterations adaptively and dynamically through one defined parent heuristic via non-domain feedback. The design was inspired by [33] work that used hyper-heuristic to tune the GA parameters. The hyper-heuristic algorithm has used the ISR that dynamically and self-adaptively improves the solutions at runtime [2]. As a result, the strengths of each automatic clustering instance running on the different chromosome sizes can be utilized and the weaknesses of the collaborating algorithms can be balanced. The proposed hyper-heuristic measured the solution quality by optimizing the chromosome sizes.

Quality Factor
A suitable hyper-heuristic strategy, i.e. the improvement selection rules (ISR) has been selected. The ISR is based on Tabu search and measures the quality, diversification and intensification operators [2] as shown in table 3.  The quality operator shown in table 3 determines improvement in the optimization objective function. The diversification operator checks how diverse the current and previously generated solutions are against the population of potential candidate solutions. Meanwhile, the intensification operator evaluates how close the current and previously generated solutions are against the population solutions. These three ISR operators have been mapped to the automatic clustering quality validation parameters, namely fitness, robustness and accuracy as shown in table 3. The accuracy is the difference between the suggested number of clusters and the real number of clusters [41]. Meanwhile, the robustness was measured by the difference between the most optimal and the second most optimal number of clusters. The mapping was made according to the similarity of the ISR operators to the automatic clustering quality validation parameters.
The modified ISR operators are as follows: i. Fitness: this is the quality improvement operator that compares the current best fitness solution ( ) with the previous best fitness solution ( ) [2]. For example, if the is 5 and is 3, the value of fitness improvement is 2. It indicates that the current solution improves from the last solutions. If the improvement value is less than zero, the hyper-heuristic algorithm will initiate another instance of automatic genetic clustering.
- (1) ii. Robustness (exploration): this operator calculates the difference between the number of cluster of current best solution ( ) with the second-best solution ( ) [2]. For example, the is 5 and the is 2, the robustness is 3. It shows a difference between the number of clusters of current best solution and second-best solution. The larger robustness indicates larger exploration. Then, the operator compares the robustness with previous iteration best solution ( ) as follows: (2) If the value is less than zero, the hyper-heuristic algorithm will initiate another enhanced GCUK algorithm.
iii. Accuracy (exploitation): the accuracy is equal to 1 divided by the difference between the number of clusters of the best solution ( ) with the previous best solutions` average number of clusters ( ). It is formulated as follows: (3) For example, if the is 5 and the is 3, the accuracy is 0.5. An accuracy of 1 is the highest accuracy that indicates the number of cluster for each best solution is accurate. Then, this operator calculates the difference between the current best solution accuracy and the previous best solution accuracy [2]. The exploitation improves given the followings: (4) is the current best solution accuracy while is the previous best solution accuracy. For example, if the is 1 and the is 0.5, it shows that there is an improvement in the exploitation. Meanwhile, if the value is less than zero, the hyper-heuristic algorithm will initiate another enhanced instance of automatic genetic clustering.
According to [2], the selection of the search operators needs to take into account the balance between exploration and exploitation. The proposed Tabu search ISR-based hyper-heuristic algorithm is described by the pseudocode shown in figure 2 (right) and compared with the existing algorithm (left) [40].  In the pseudocode shown in figure 2, the Tabu list refers to a list that keeps the GCUK algorithms that fail to improve the current threshold (F1, F2, F3). The Tabu max is the size of the list that can be deduced based on the number of heuristics [40]. The proposed Tabu max in the modified hyper-heuristic algorithm (right) is 11 because there are 11 instances of the automatic genetic clustering algorithm with different chromosome size being run in turn by the hyper-heuristic algorithms to find the optimal solutions.

Time Factor
The time factor has been measured by the real time it takes for the proposed hyper-heuristic algorithm to deliver results [33]. The time should be feasible for the intended application [33].

Results and Discussion
The Iris, WDBC, Wine and E.coli datasets were used as benchmarking datasets. These datasets were tested on the proposed modified ISR hyper-heuristic algorithm. Table 4 shows the results. . The main reason why the hyper-heuristic algorithm has higher execution time is because it was running multiple automatic genetic clustering instances alternately and the selection and acceptance operator as shown in figure 2.
Then, the results of number of clusters were measured based on accuracy and robustness. Table 5 shows the comparisons.  A negative accuracy value will show the produced number of clusters is lesser than the real number of clusters and for positive accuracy, it is vice versa. A robustness higher than zero shows that the algorithm is robust. It is identified that the accuracy and robustness of the hyper-heuristic algorithm are similar to the enhanced GCUK for the iris, wine and E-coli datasets. Meanwhile, the accuracy is better for WCBC dataset.
The results have proven that the ISR-based hyper-heuristic algorithm has produced good performance (fitness) but consume considerably higher execution times [2,42]. For a larger dataset, the execution time of the hyper-heuristic has to be given careful consideration.

Conclusion
The automatic genetic algorithm can generate a dynamic chromosome size as the results of using a modified hyper-heuristic algorithm. The fitness and accuracy of the GCUK have been improved and the hyper-heuristic algorithm has been modified to suit the automatic clustering problem. For future work, machine learning algorithms can be used to enhance the hyper-heuristics selection operators aiming to improve the solution fitness, accuracy and robustness and reduce the execution time by learning from historical data.