Grouping the community health center patients based on the disease characteristics using C4.5 decision tree

Community health centers (Puskesmas) is one of the important public health service facilities in Indonesia. Puskesmas serves many patients on performing the examination or treatment in every day. Accumulated medical record data is not utilized to generate new information or knowledge. One existed datamining techniques is the process of grouping an object with unknown label into a class. The C.45 algorithm is used to mine the patients diagnosis data available on 2015 2016. As a result, C4.5 algorithms can be applied for grouping disease. The first test using 85 training data has 78% accuracy level, while the second test of 115 training data reaches 88% accuracy rate.


Introduction
Community health centers (Puskesmas) in Indonesia is a health center owned by government. Puskesmas always improve the quality of service to the patient through the way of involving technological progress in health world. So the current government of Indonesia immediately take action such as in the form of Social Health Insurance Provider Body, called BPJS. As a service in the form of BPJS Health it can be ascertained the number of patients increases. Activities at this community health centers can generate and collect a lot of medical record data every day. Heaps of medical record data are used for operational needs. Daily medical record data is always increasing. It can be explored to be used as a source of historical data to find a new pattern and knowledge for the community health centers, community and related agencies. This way can be called as a data mining technique [1][2][3]. Data mining is a term used to describe the discovery of knowledge in a database. Data mining is a process that uses statistical, mathematical, artificial intelligence, and machine learning techniques to interact and identify useful information and related knowledge from large databases [1,4,5].
Previous studies have utilized the data mining for many purposes and techniques, such as early prediction of heart diseases [6], to predict liver diseases progress [7], gut microbiota profiles characterization in coronary artery disease patients [8], breast cancer detection [9, 10], and mind performance in Alzheimer's disease [11]. One of techniques, which used for this conducted research, is C4.5 algorithm [12]. C4.5 algorithm is based on decision tree form [12][13][14].
This research aims to mine valuable information from the historical patient data by using C.45 algorithms. The data will be grouped based on disease characteristics. Therefore, a system that can help the community health centers in determining the number of patients is badly needed.

Dataset
We have collected 150 patient data from Community health centers (Puskesmas) Jetis 1 Bantul district, Indonesia. The data is an attribute owned by the patient, the data in question is the data that has at least two columns of attributes. One column as the insert attribute column and another as the target attribute column. From each column there are values to be used for calculation, and the value of each attribute must be discrete. The application will read the input with the target attribute located in the last column of the table. Therefore, from the last column the system will recognize it as the input attribute of the system. Some components of variables were: 1) Age. This variable contains the age of each data held by the patient to be filled in the program input process. Symptom data obtained include atrophy, cough, shortness of breath, nausea and vomiting, fever, headache, chills. Grouping based on the provisions made by the program has 2 values that is yes and no. 5) Type of Illness. This variable is data that serves to determine the outcome of the decision. In the grouping of data has been fixed permanently to avoid errors in the calculation process of the program. Decision data has two values: "Infection" and "Degenerative"

Research design
The main purpose of design is to provide a design description to be built, as well as to understand the flow of information and processes within the system. Figure 1 determined the stages to be performed in system design. The calculation process is done by C4.5 algorithm method, to get the entropy value and gain value, which will be made a decision tree with node and node.

C4.5 algorithm
In the C4.5 algorithms, decision trees are formed based on the decision-making criteria. The decision tree is a very powerful and well known method of classification and prediction. The decision tree method transforms a very large fact into a decision tree that represents the rule. Rules can be easily understood with Natural language. They can also be expressed in the form of database languages such as Structured Query Language to search for records in certain categories. In general, the C4.5 algorithm constructs a decision trees by selecting attribute as root, creating a branch for each value for the case in the branch, and the process will be repeated for each branch until the case on the branch has the same class [12,13]. The attribute selection is based on the highest gain value, using equation (1).
where S are case set, A are attributes, N is a number of attribute partition A, |Si| is the number of cases on the i-th partition, and |S| are the number of cases in S. The value of entropy is calculated before getting a gain value. Entropy is used to determine how informative an attribute is to generate an attribute. The basic formula of entropy is as in equation (2).
where S are case set, A are features, N is a number of partition S, p ୧ is a proportion from S ୧ to S.

Results and discussion
To perform data mining process, we start on data preprocessing which includes the steps of data cleaning, data integration, data selection, and data transportation. Table 1 shows the transformation result of age. In the data mining stage we calculate an ‫)݈ܽݐܶ(ݕݎݐ݊ܧ‬ by using equation (2) Table 1 shows that the highest gain value is age variable (0.437617). Thus, age can be used as a root node. There are four attributes of age which have decisions, except the category 25-44 age that has not produced yet a decision. Furthermore, we need to process for it age category (table 2).  From the table 1 and table 2, we can create a knowledge representation which represented by a decision tree (figure 2). In table 2, four attributes age has each decision, 15-24 is infection and 45-64 is degenerative. Hence, no further calculation is required, but for 15-24 attribute is still needed further process. In Table 2, both Atrophy attributes have degenerative decision, so no further calculation is required. Figure 2 shows the final result of the decision tree which is containing the rules, as follows: 1) When the age is "15-14 years", then the decision is infection.
2) When the age is "45-64 years", then the decision is degenerative.
3) When the age is "65+ years", then the decision is degenerative. 4) When the age is "25-44 years", then the decision is atrophy. 5) When the atrophy is "Yes", then the decision is degenerative. 6) When the atrophy is "No", then the factor is cough. 7) When cough is "Yes", then the decision is infection. 8) When the cough is "No", then the factor is a headache. 9) When the headache is "No", then the decision is degenerative. 10) When the headache is "Yes", then factor is fever. 11) When the fever is "Yes", then the decision is infection. 12) When the fever is "No", then the factor nausea-vomiting. 13) When nausea-vomiting is "Yes", then the decision is degenerative. 14) When nausea-vomiting is "No", then the decision is infection.

Conclusion
It can be concluded that Conclusions that data mining can be used to help provide useful information in predicting the number of groupings diagnosis of patient where in this thesis use one of algorithm from data mining that is C4.5 algorithm. The result of accuracy testing with confusion matrix method, test one with amount of training data 85 and data testing 33 produce 78% accuracy with error 22%. The two tests with the amount of training data 115 and the 18 test data yielded 88% accuracy with 12% error.