Confusion Matrices and Rough Set Data Analysis

A widespread approach in machine learning to evaluate the quality of a classifier is to cross – classify predicted and actual decision classes in a confusion matrix, also called error matrix. A classification tool which does not assume distributional parameters but only information contained in the data is based on rough set data model which assumes that knowledge is given only up to a certain granularity. Using this assumption and the technique of confusion matrices, we define various indices and classifiers based on rough confusion matrices.


Introduction
In pattern recognition and other disciplines of machine learning, the sum of the diagonal elements of a confusion matrix is widely used to measure the success of a classification based on an algorithm or human observation in comparison with a gold standard (or "true" measurement) such as classification by an expert.The main idea is that an algorithm (or an observer) forms its own hidden equivalence classes of the data, and is forced to assign the classes to the categories given by the gold standard.The underlying model may be one of a plethora of existing techniques see e.g.[1][2][3].The question may be asked, whether such an index is valid for determining the quality of a classifier: Since we approximate sets, namely, decision classes, one should use a theory of set approximation such as the rough set approach to investigate this question.
In a first step we find a connection of a rough set decision system and a resulting confusion matrix.We derive several approximations of upper and lower bounds of the classes given by the gold standard; additionally, we consider the standard indices of rough set analysis for the coverage.Owing to lack of space we shall only indicate the procedures, and detailed results and proofs will appear elsewhere.

Definitions and notation
Throughout, U denotes a finite nonempty set with n elements.Given a set Y = {Y 1 , . . .,Y k } of decision classes, a classifier is a mapping f : U → Y which predicts the class membership of an element of U in a decision class.The predicted and true values of class membership can be cross-classified and counted in a confusion matrix.If success of a classifier is measured by error rate, confusion matrices may be used to analyse and to compare classifiers.A widely used confusion matrix of dimension two is shown in Table 1, and a general confusion matrix is shown in Table 2.An entry Ŷi ,Y j = n i j in the matrix is the number of elements of Y j which are predicted to be in Y i ; in particular, ∑{nii : 1 ≤ i ≤ k} is the number of correctly classified elements.
The philosophy of rough sets is based on the assumption that knowledge of the world depends on the granularity of representation [4].Mathematically, granularity may be expressed by an equivalence relation θ on a nonempty finite set U , up to the classes of which membership in a subset of U can be determined.For rough approximation, two operators are defined on 2 U in the following way: Let ) The main data type of the rough set approach are decision systems which are closely related to relational data tables with an added decision attribute.An example is shown in Table 3; there, the object set U contains six elements, there are four independent attributes, and one decision attribute d.For simplicity of notation, we suppose that an attribute a is a mapping from U to the set V a of values of a.Each set Q of independent attributes gives rise to an equivalence relation θ Q on U by setting xθ Q y if and only if a(x) = a(y) for all a ∈ Q.Similarly, the decision attribute d induces an equivalence relation θ d , the classes Y := {Y 1 , . . . ,Y k } of which are called decision classes.We cross-classify the classes of θ with the decision classes in a granule frequency matrix, see Table 4; there, Furthermore, we introduce the following parameters for each decision class Y i : Consider the vector X i = c i j : 1 ≤ j ≤ k belonging to granule X i .If X i contains only one non-zero entry, we call the granule deterministic.In this case, X i ⊆ Y j and prediction based on A major aim of rough set data analysis is to decide (or estimate) membership of an element x of U in a decision class using the knowledge given by a set Q of attributes, in particular, how well the decision classes can be approximated by the knowledge obtained from a partition induced by Q.Note that we can define a partial classifier f r as follows: If D = {X i : X i is a deterministic class}, then each x ∈ D is correctly classified (and these are the only ones).Thus we can set f r (x) = x for all x ∈ D. If x ∈ X i and x ∈ D, then the rough method assigns x to one ore more upper approximations of decision classes.In this sense, rough approximation is not a point estimate.With some abuse of language, we call f r a rough classifier.
In the sequel, we suppose that X = {X 1 , . . ., X m } is the set of classes of a fixed equivalence relation θ on U , called granules, and Y = {Y 1 , . . .,Y k } is a set of decision classes; to avoid trivialities we assume that k > 1. Lower and upper approximations are taken with respect to X , and we shall omit the indices in the approximation functions.We shall write , and the sets Z i are pairwise disjoint.At times, we are only interested whether the entry in a cell is 0 or not.To this end, we introduce an indicator function Ind : N → {0, 1} defined by For the basic philosophy and tools of the rough set method the reader is invited to consult [5].For recent developments and more advanced methods the overview [6] is an excellent source.

Rough confusion matrices
According to the rough set philosophy, we can only distinguish elements of U up to equivalence with respect to θ , hence, we must have f (x) = f (y) for any classifier f whenever x and y are in the same granule.Thus, with some abuse of language, we call a function f : X → Y a (rough) classifier.The meaning of the classifier f is that each element of X i is predicted to be in f (X i ).Thus, we obtain the predictor sets If Ŷi = / 0, then no element of U is predicted to be in Y i by any class X s using f .The (rough) confusion matrix of the classifier f has dimension k × k, row labels Ŷi , column labels Y j and, for 1 ≤ i, j ≤ k, the entries The rough confusion matrix can be obtained in several steps: (i) Write the granule frequency matrix M obtained from X and Y as in Table 4.
(ii) Relabel the rows of M by f (X i ) by replacing X i with f (X i ).
(iii) Aggregate the frequencies of the rows with the same label according to (3.2).If f −1 (Y j ) = / 0, fill the row labeled Ŷj with 0s.(iv) Sort the rows according to indices of their labels.The result has the form shown in Table 2.
Example 1.We shall use the decision system of Table 3.Let θ be the equivalence relation generated by the attributes Price and Screen.The partition generated by θ has the classes and the decision classes The construction process is shown in Tables 5, 6, and 7. Note that f classifies five of the six elements of U correctly, so that its Table 5.The granule freq.matrix Table 6.The relabeled matrix Table 7.The confusion matrix success ratio is 5  6 , where as γ = 4 6 .
According to the rough set philosophy, the set Low(Y i ) approximates the diagonal set Ŷi ∩Y i .The optimal approximation would be Low(Y i ) = Ŷi ∩Y i with | Ŷi ∩Y i | = n ii ; in this case, Y i is deterministic with respect to X .Without knowledge of the source information system, but given the resulting confusion matrix, we obtain only Two statistics are of importance in the rough set literature: The rough approximation quality is the weighted sum and the accuracy of approximation of the decision class Y i is defined by the index Here, p i := nl i n i and p i := n i nu i are precision indices [7].The measure α i is the maximal (best possible) value for the approximation quality of the set Y i of an information system which produces the observed confusion matrix.
Note that γ and the upper bound weighted mean value of the α i are linked by a strictly monotone transformation, since Therefore, they are interchangeable as a measure of overall approximation quality.
The α -accuracy is connected to the confusion matrix (and not to the underlying information system) by As α is a weighted mean of the α i and γ is a strictly monotone function of α, we observe that upper confusion γ and upper confusion α are maximal as well.

Refining the rough classifier
Thus far, we have put no restrictions on the classifier function f .In order to bring the concept closer to rough sets, and use more of the available information, we shall suppose in the sequel that a rough classifier satisfies the condition (4.1) implies that at least one element of X i is classified correctly by f .Furthermore, Our first task is to approximate nl j = |Low(Y j )|.To this end, we first consider n * j := n j j .The cell n j j counts, in particular, the cardinality of the deterministic granules contained in Y j , and thus, nl j ≤ n * j .We can further remove certain entries, and define nl * * j := n j j − Ind ∑ j =i n ji .Using Lemma 4.1 it is not hard, if somewhat tedious, to show the relationships among these indices: Not all of these inequalities need to hold if f does not satisfy (4.1).
Turning to upper approximations, we first observe that (4.1) is equivalent to X i ⊆ Upp( f (X i )) by (2.2), and thus, Ŷj is a lower bound of the rough upper approximation of Y j , i.e. | Ŷj | ≤ nu j .This can be sharpened as follows: Set A moment's reflection shows that ∑ i n ji adds all the cells in the partial granule frequency matrix spanned by the rows X i where f (X i ) = Y j , and ∑ i = j n i j adds the entries c i j , where X i ∩Y j = / 0 and f (X i ) = Y j .
If n i j = 0, then n ii = 0 by Lemma 4.1, and therefore, there is some X s , such that f (X s ) = Y i and X s ∩Y j = / 0, i.e.X s ⊆ Upp(Y j ).Therefore, if n i j = 0, there is at least one additional element which is in Upp(Y j ).Hence, we obtain a sharper bound by setting nu * * j := nu * j + ∑ i = j Ind(n i j )).Altogether, this leads to the following result: Arguably, the simplest classifier that satisfies (4.1) is a maximal row classifier f mrc defined as follows: Consider a granule frequency matrix shown in Table 4.For each 1 ≤ i ≤ m choose some 1 ≤ j ≤ k such that c i j is maximal in {c i1 , . . ., c ik }.Such j always exists, but the choice need not be unique.Then, set f mrc (X i ) := Y j .The classifier f mrc satisfies (4.1), and it is well compatible with the rough set philosophy in using only information supplied by the data.
By definition, X i ⊆ Ŷj implies that c i j is a maximum in row i.We can use this observation to establish an even sharper upper bound of nl j : Suppose that Ŷj = {X s 1 , . . ., X s p }, and consider the partial granule matrix Since a maximum of each row is in column Y j , it follows that n jt ≤ n j j for all 1 ≤ t ≤ k, and therefore, max{n jt : 1 ≤ t ≤ k,t = j} ≤ n j j .Setting nl m j := n j j − max{n jt : 1 ≤ t ≤ k,t = j} ≥ nl j we obtain Theorem 4.3.nl j ≤ nl m j ≤ nl * * j for all 1 ≤ j ≤ k.
Finally, we estimate the rough upper bound of Y j using f mrc .Setting nu m j := n j j + ∑ j =i (n ji + 2 • n i j ), it can be shown that Theorem 4.4.nl * * j ≤ nu m j ≤ nu j for all 1 ≤ j ≤ k.

Conclusion and outlook
In this note, we have explored a connection between rough set approximation and confusion matrices, and have presented several natural indices that approximate the lower and upper bounds given by the reference standard.Owing to lack of space, we have only indicated the procedures with respect to one observer.
The next step will be to broaden the investigation to two or more observers: Each of these has internal sets X and X ′ of granules which need to be reconciliated to a common standard.This is related to interrater reliability which is a common technique used in psychology (and AI) to gauge agreement among experts.We shall also re-interpret common statistics of rough set analysis based on rough confusion matrices.This will, in some sense, complement our earlier research on precision indices in the rough set framework [8]. Bibliography

Table 1 .
A 2-class confusion matrix

Table 2 .
A general confusion matrix

Table 3 .
A decision system

Table 4 .
A granule frequency matrix Confusion size n j1 ... n j j ... n jk