Formal modeling of Gene Ontology annotation predictions based on factor graphs

Gene Ontology (GO) is a hierarchical vocabulary for gene product annotation. Its synergy with machine learning classification methods has been widely used for the prediction of protein functions. Current classification methods rely on heuristic solutions to check the consistency with some aspects of the underlying GO structure. In this work we formalize the GO is-a relationship through predicate logic. Moreover, an ontology model based on Forney Factor Graph (FFG) is shown on a general fragment of Cellular Component GO.


Introduction
The high-throughput sequencing technologies entail new challenges in data processing. As a result, the use of machine learning algorithms has become relevant in many bioinformatics applications [1,2]. In particular, for Automated Functional protein Prediction (AFP) based on Gene Ontology (GO), ensemble methods consider the ontological structure (DAG, directed acyclic graph) through a hierarchical classification [3,4]. The design of ensemble methods is made in two steps: i ) In the first one, a set of binary classifiers is built to predict GO-terms (classes), ii ) In the second one, consistency of GO-DAG relationship is done. Focusing on this last step, different solutions based on heuristics [5,6] have been proposed such as the True Path Rule (TPR) algorithm [5] where the is-a DAG relationship is fulfilled implementing the rule: "If the child term describes the gene product, then all its parent terms must also apply to that gene product".
Heuristic solutions may be a good accuracy-effort trade-off, but a step beyond to attempt logical formalization for checking the DAG restrictions. In this paper we propose that formalization through Forney Factor Graph (FFG) model [7], since it allows a TPR restriction representation by the logical factorization of functions of several variables associated with GOterms. The achieved model, called FFG-GO, is able to infer functional predictions of genes by using the sum-product algorithm [8].
In the next Section a brief background of FFG is presented. Then, in Section 3 TPR restrictions are formalized thought predicate logic, in Section 4 FFG-GO model is described. In the last Section, a subgraph of Cellular Component GO is modeled by FFG-GO.

Background
An FFG is a model that represents the factorization of a function for several variables. Briefly, the FFG diagram ( Fig. 1) has nodes, ordinary edges, and leaf-edges interpreted as follows: each factor f i (also called function) is represented by a node; each state variable S j is between two factors and is represented by an ordinary-edge; and each input variable A k must be involved in just one factor and represented by a leaf-edge. The global function (f ) is factorized into factors as a product of local functions, where each factor depends on a subset of variables of f . For instance, the function f (A 1 , A 2 , A 3 , A 4 , S 1 , S 2 , S 3 ) can be factorized as: Figure 1: FFG diagram. Boxes are factors, f i theirs functions; ordinary-edges, S j theirs state variables; and leaf-edges, A k theirs input variables.

Formalizing GO and TPR restrictions
Aiming to include the GO constraint into FFG model, the TPR restriction is formulated by predicate logic. Next, GO-DAG is denoted by G = (V, ≤), where V is a set of GO terms and ≤ is a binary relation on V . GO terms are represented by GO nodes (GO 1 , · · · , GO m ) where m is the cardinality of V . GO l ≤ GO z denotes that GO z is a parent of GO l , written in predicate logic as is a(GO l , GO z ).
The Gene Ontology Consortium has defined the TPR restriction to guarantee the is a GO-DAG consistency as follows: "An annotation for a class in the hierarchy is automatically transferred to its ancestors, while unannotated genes for a class cannot be annotated for its descendants". Note that classes are GO terms and annotation is the process of assigning GO terms to gene products.
Based on [9], we rewrite the TPR by two rules of predicate logic: one related to the parents (Eq.2) and the other related to offspring (Eq.3).
where pos(GO k ) means a gene annotation with the GO k term.

FFG-GO model
Regarding the GO behavior modeling under FFG, some GO issues are briefly recalled. The gene ontology is composed of nodes representing gene functions, i.e., the k-th node and its function are jointly called GO:k. The GO nodes are connected in a DAG structure through edges that characterize node relationships. Without losing generality, in this work we focus on is-a relation. In Fig. 2(a) a comprehensive GO example that considers multiple offspring-parents is illustrated, while Fig. 2(b) shows its FFG-GO counterpart. GO nodes are matched to FFG model by input variables for root and leaves nodes, and by input state variables for inner nodes. For instance, the leaf GO:5 matches to A 5 , and the inner node GO:2 matches to A 2 , S 1 , and S 2 . The DAG relationships together with GO restrictions are represented by FFG functions that depend on state and input variables. For instance, r 12 and r 13 relation match to the function f 1 . The strength of FFG model is the design of functions f i through logical expressions [10] to describe both the native FFG constraints and the GO-DAG restrictions. In the FFG-GO approach we identify three kinds of factors: equality, multiple offspring, and multiple inheritance. Note that the equality is a native FFG factor, but the inclusion of the multiple offspring together with multiple inheritances allows the formal implementation of TPR.
Equality constraint: Also called identity function and symbolized with f = forces all its variables to be equals. It is required in the FFG building [7,11]. Fig. 3 shows the identify function block, Table 1 describes its truth table, where as expected f = = 1 when all its variables (A 2 , S 1 , S2) are equals.  Multiple offspring: Symbolized with f ∧ , it describes the allowed node states depending on its multiple children states. See Fig. 4 where GO:1 matches to A 1 , GO:2 to A 2 and S 1 , and GO:3 to A 3 and S 3 ; Table 2 The general form for multiple offspring constraint is: . . ∧ ¬pos(S j ) j = 1, . . . , n 0 otherwise (5) The A k is the FFG input variable that represents the parent GO node, and S 1 to S n are variables that represent its children.
Multiple inheritance: Symbolized with f ∨ , it describes the allowed node states depending on its multiple parent states. See Fig. 5 where GO:4 matches to A 4 and S 6 , GO:2 to A 2 and S 2 , and GO:3 to A 3 and S 4 ; Table 3 The general form for multiple inheritance constraint is as follow: . . , n 0 otherwise (6) The S 1 to S n are the variables that represents the parent GO nodes, and A k is the input variable that represent its son.
Summarizing the GO to FFG-GO matching of the multiple offspring-inheritance example, Fig. 2, is shown in Fig. 6.

Results and discussion
In order to show the strength of the proposed FFG-GO, a model on a rich fragment of GO Cellular Component (CC) is done. This CC subgraph has all the TPR restrictions discussed above, see  Fig. 8 shows FFG-GO model associated with CC fragment in Fig. 7. The f shows that the GO CC and FFG-GO matching, i.e., GO-terms match directly to A k input variable. In one hand, the CC root GO:1 matches to A 1 input variable, and has always associated with the multiple offspring function f ∧ . On the other hand, leaves (GO:9,GO:13 ) match to A 9 , A 13 input variables, and have always associated with the multiple inheritance function f ∨ . We should note requires just a redesign of functions f without changing the core of FFG-GO model. Likewise, the insertion of new GO-terms is simple and easy to interpret graphically. In order to complete the GO-term annotation prediction for each sample (protein), a dynamic inference process must takes place on the FFG-GO model. This inference process is carried out by the sum-product algorithm, but its description is out of the scope of this work, see [8] for details.

Conclusion
In ensemble classification methods, GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. Focusing on formal strategies, the formalization of TPR restrictions by predicate logic. The further work is to embody this formalization in a hierarchical classification method based on graphical models. Along this paper we have focused our attention on is a relationship of GO. However, this approach may be extended to another types of transitive relationships, such as part of, has part.