Decision tree and bagging algorithm for the automatic identification of epithelial cell of wound

In this paper, a pattern recognition algorithm of automatic identification of epithelial cell was researched. Decision tree and bagging algorithm were used to find the most important features for the automatic identification of epithelial cell of wound. Different features were used to evaluate the identification performance. Three features are regarded as the most important ones. The method and algorithm can also be used to other applications with respect to automatic cell recognition.


Introduction
The motion of epithelial cell is very important to the healing of wound. The research on the analysis and mathematical model of the microscopic image provide an effective tool for this aim. Through analyzing the parameters of motion in specific environment for a type of epithelial cell and setting up the motion model, we can understand the mechanics of wound healing quantitatively. The premise of all these is to find the most distinguishing characteristics and identify the epithelial cell from its background.

Cell culture and drug treatment
HCT116 human epithelial cell line obtained from American Type Culture Collection (Manassas, USA) was grown in RPMI 1640 medium (Invitrogen, USA) supplemented with 10% (v/v) fetal bovine serum (FBS; Invitrogen) and 1% penicillin-streptomycin (Welgene, Korea) in a 37°C humidified incubator in an atmosphere of 5% CO2. Drug treatment of cells was performed by adding 50 M oxaliplatin (L-OHP; Boryung Pharmaceutical, Korea) or vehicle (for control sample) to the culture medium and incubating for 48 h.

Hoechst 33342 (HO) staining
As nuclear condensation and fragmentation is best wellknown features of dead epithelial cells and this is also very simple and easy way to detect them. Therefore, it is a wellestablished method widely used in the detection of cell that the nuclei are stained to observe their alteration. In our experiments, the cells were incubated with 1 µg/ml HO for the final 10 min of drug treatment in the 37°C incubator and then, both floating and attached cells were collected by centrifugation. The pooled cell pellets were washed with icecold phophate-buffered saline (PBS), fixed in 3.7% formaldehyde on ice, washed again with PBS, and a fraction of the suspension was centrifuged in a cytospinner (Thermo Shandon, Pittsburgh, PA).

Experimental images
The original image size captured is 2560×1920. Fig.1 shows the light image and the Hoechst image in our experiment. In Fig.1 (b), it is easy to see that: (1) live epithelial cells are of low brightness, big size and circular shape and dead ones due to wound are of high brightness, tiny size and near-circular shape.

Feature extraction and analysis
In the normal framework of image-based cell recognition system, the image preprocessing and image segmentation are two initial steps. However, they will not be elaborated in this paper. We only focus on the features. A good feature should remain unchanged if variations take place within a class, and it should reveal important differences when discriminating between patterns of different classes. In other words, patterns are described with as little loss as possible of pertinent information.
In this experiments, for each case, 10 features were estimated automatically from illumination intensity, morphological and textural nuclear features [1]. Information about nuclear size and shape was captured by morphological features, which constituted measurements of nuclear area, roundness and concavity [1]. The concavity attempts to measure the severity of concavities, or the indentations of a nucleus [1]. The remaining two textural features that encoded chromatin distribution of the cell nucleus. These features were estimated by means of nuclear histograms and the co-occurrence matrix [2]. Table.1 lists some typical images of normal epithelial cells, dead cells and their respective feature values. Ten features were used in our experiments and the description for each feature is listed in Table.2.
(a) The reference light image (b) The Hoechst image Fig.1 The reference light image and the Hoechst image

Pattern recognition research
In this paper, the importance of all features and the classification accuracy were evaluated. The classification algorithm used here is decision tree algorithm.

Decision tree algorithm
Decision tree algorithm has been used broadly for several years. It is an approximate discrete function method and can yield many useful expressions. It is one of the most important methods for classification. This algorithm's terms follow the "tree "metaphor.It has a root, which is the first split point of the data attribute for building a decision tree. It also has leaves, so that every path from root to leaf will form a rule that is easily understood. Since the decision tree is built by given data, the data value and character will be more important. For example, the amount of data will affect the result of the tree building procedure. The type of attribute value will also affect the tree model. Decision trees need two kinds of data: training and testing. Training data, which are usually the bigger part of data, are used for constructing trees. The more training data collected, the higher the accuracy of the results. The other group of data, testing, is used to get the accuracy rate and misclassification rate of the decision tree. Many decision tree algorithms have been developed. One of the most famous one is ID3 [3,4], whose choice of split attribute is based on information entropy. C4.5 is an extension of ID3 [5], which improves computing efficiency, deals with continuous values, handles attributes with missing values, avoids over-fitting, and performs other functions.

Bagging algorithm
In data mining, an approach to make decisions more reliable is to combine the outputs of different models. Several machine learning techniques do this by learning an ensemble of models and using them in combination: prominent among these is a scheme called "bagging" [6].
Bagging predictor is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and performs a plurality vote when predicting a class. The multiple versions are created by making bootstrap replicate of the learning set and using these as new learning sets. Test on real and simulated data sets using classification and regression trees and subset selection in linear regression have shown that bagging can provide substantial gains in accuracy [7].
Bagging attempts to neutralize the instability of learning methods by simulating the process using a given training set. Instead of sampling a fresh, independent training dataset each time, the original training data is altered by deleting some instances and replicating others. For the replacement, instances are randomly sampled from the original dataset to create a new one of the same size.This sampling procedure inevitably replicates some of the instances and deletes the others.

Data description
The dataset includes 10 categories of features extracted from 25000 cells in 25 microscopic images. The total number of data is 10 × 25000.

Experiments
In order to compare the experimental results of Bagging decision tree algorithm, another two classification methods were also applied here, decision tree algorithm and Bagging Naïve Bayes, All experiments were implemented in Matlab 2011B platform and using Matlab statistical toolbox. The first step to construct the classification ensemble will be to find a good leaf size for the individual trees.
Here three different sizes of 4, 6 and 10 were tried. 30 trees were used. For reproducibility and fair comparisons, the random number generator were reinitialized, which was used to sample with replacement from the data in each time we build a classifier. The error are comparable for the three leaf-size options. A leaf size of 10 will be used because it results in leaner trees and more efficient computations. Here we did not split the data into training and test subsets. This is done internally, it is implicit in the sampling procedure that underlines the method. At each bootstrap iteration, the bootstrap replica is the training set, and any customers left out ( "out-of-bag" ) are used as test points to estimate the out-of-bag classification error reported above.
Next, whether all the features are important for the accuracy of our classifier was researched, which was done by turning on the feature importance measure and plot the results to visually find the most important features. It is easy to see that feature 3 (MedianIntensity), 7 (MinorAxisLength) and 8 (Area) stand out of the rest while feature 4 (Eccentricity), 9 (Contrast) and 10 (Entropy) has the least ability to separate dead epithelial cells from normal live ones. Generally speaking, illumination features and size features has most separation ability while texture features has less separation ability, which exactly demonstrate the intuitive feeling that dead and live epithelial cell give to us.
The following figure shows comparison of the classification of using all features and only 3 most contributing features.It is evident that when the number of trees is not very big, the classification results using three features are comparable to the results using all ten features.
(a) Feature importance graph (b) Classification error using most important features

Performance measurement
Many different metrics are used in machine learning and data mining to build and evaluate models. I employed four performance measures: precision, recall and F-measure. Less formally, precision measures the percentage of the actual live cell among the cells that were classified as live cell. Recall measures the percentage of the actual live cells that were discovered. F_measure balances between precision and recall. According to the feature importance results, it is known that MedianIntensity, MinorAxisLength and Area are three features with best classification ability. I experimented on classification results using these three features and the remaining features respectively. The training data is 320 and test data is 80. Table 3 shows the performance results of three methods.

Conclusions
In this paper, a decision tree and bagging algorithm was used to find the most important features for the automatic identification of live and dead epithelial cells. The classification performance was also compared using different feature groups. It is easy to see that MedianIntensity, MinorAxisLength and Area are three most important features for the automatic identification with best identification performance. We think that the method and algorithm can also be used in other similar applications.