Sundanese ancient manuscripts search engine using probability approach

Today, Information and Communication Technology (ICT) has become a regular thing for every aspect of live include cultural and heritage aspect. Sundanese ancient manuscripts as Sundanese heritage are in damage condition and also the information that containing on it. So in order to preserve the information in Sundanese ancient manuscripts and make them easier to search, a search engine has been developed. The search engine must has good computing ability. In order to get the best computation in developed search engine, three types of probabilistic approaches: Bayesian Networks Model, Divergence from Randomness with PL2 distribution, and DFR-PL2F as derivative form DFR-PL2 have been compared in this study. The three probabilistic approaches supported by index of documents and three different weighting methods: term occurrence, term frequency, and TF-IDF. The experiment involved 12 Sundanese ancient manuscripts. From 12 manuscripts there are 474 distinct terms. The developed search engine tested by 50 random queries for three types of query. The experiment results showed that for the single query and multiple query, the best searching performance given by the combination of PL2F approach and TF-IDF weighting method. The performance has been evaluated using average time responds with value about 0.08 second and Mean Average Precision (MAP) about 0.33.


Research Methods
There are two major phases in this study: build the collection and build Sundanese ancient retrieval system. The two activities described in the following section.

Build The Collection
The sub-phase of build the collection consists of three steps: data acquisition, transliteration, and segmentation including annotation. The detail of four steps describe below.

Data Acquisition
The acquisition data of Sundanese ancient manuscript located in Situs Kabuyutan Ciburuy, Garut, West Java. Because of the documents are fragile, so, we canâĂŹt hold it for a long time. The image of the documents should be acquired to make it easier to be processed in the next step.

Transliteration
After the image of the Sundanese ancient manuscript obtained, then the transliteration of the documents done by the philologist. There are three philologists read the image of Sundanese ancient manuscript, then re-write the content of manuscript in Latin letters. The results of this step are the Sundanese documents that contain the content of the Sundanese ancient manuscript.

Segmentation and Annotation
Segmentation and annotation step done by collaboration between the researchers and the philologists. The philologists segmented and annotated the documents that obtained from the transliteration step by word manually. The obtained results, then convert to digital segmentation and annotation by the researchers using Aletheia [2]. The results of this step are segmented words, annotation in Latin letters, and the coordinates of words in the documents.  Figure 1 shows the framework of Sundanese ancient manuscript search engine. The collection of Sundanese ancient manuscripts in Latin version processed in weight terms phase. There are three methods of weighting terms that used in this study: term occurrence (to), term frequency (tf), and term frequency âĂŞ inverse document frequency (tf-idf). After calculating the term weights, then the set of query runs in three different probabilistic methods. The three probabilistic method set with different parameter values that give based performance based best practices. The parameter values of three probabilistic methods can be seen in experiment and results part. The results of search deliver in interactive interface and the mean average precision of three methods compared.

Bayesian Network Model
Based on the reference [3], BNM is described as a directed acyclic graph (DAGs) where the nodes represent random variables. The random variables that use in this study are the documents, the terms, and inference results. The arcs in graph expressed causal relationships between variables called the queries. The conditional probabilities described the strength of this dependency relationship. The BNM is one of the probabilistic models that provide clean formalism that combined multiple sources of evidence from a document. Sources of evidence used is past queries, feedback past cycles, and distinct query formulations. For every result search that given by the system according to the query will be saved and used again in the next search process which involved the resemblance query. Let t be a y-dimensional vector defined by t = (t 1 , t 2 , ..., t y ) where t 1 , t 2 , ..., t y are binary random variables t i ∈ 0, 1. These variables define the 2 y possible state for t. Let d j be a binary random variable associated with a document in the collection and let q be a binary random variable associated with the user query. The inference can be calculated by: The P (q|t) is term-query beliefs that can be obtained from (2). A reasonable default value for w q = 1.
The P (d j ) is prior probability of observing a document d j that can be set 1 N where N is the total number of documents in the collection.
The prior probability of the document-term beliefs can be obtained from (4) below.
Where f i , j is normalized term frequency given by f i , j = f i , j maxf i , j and IDF i is inverse document frequency variable given by (5): Based on empirical evidence, the value of α is 0.4. The final step is combining the evidence. It can be done by retrieve all the type of the past query that relevance to the new one. In order to get the ranking provided by inference network, equation (6) should be calculated.

Divergence From Randomness with PL2 Distribution
Based on the reference [4], the idea of Divergence From Randomness (DFR) method is to compute the term weights by measuring the divergences between a term distribution proceeds by the random process and the actual term distributions. A word that contains a bit of information is generally scattered in the collection. We call C as a collection, then a term k i has a probability distribution with respect to C as follows P (t i |C), then the probability of a number of other information is,−logP (t i |C). Term distribution in which a term referred to as an elite set, then the probability to a document given by P (t i |d j ). The smaller probability of occurrence of the k i There are many probability distribution families that can be implemented in DFR Framework. One of the distribution that can be implemented is PL2. Based on [5], the indexing weight of the query terms w i j in the document d j calculated with (7): For P rob 1 ij , can be replaced by PL2 distribution. PL2 is Poisson model with Laplace after-effect and normalisation 2. The formula of PL2 distribution can be seen in (8).
The to j indicates the number of occurrences of term t i in the collection. The P rob i j 2 can be replaced by normalizing term frequency that can be calculated with the following equation (see Eq (9)).
The c is a constant with value 1.5, the meandl is the average document length, the n = number of documents in the corpus, and l i = the length (number indexing terms) of document d j in the collection.

Divergence From Randomness with PL2F Distribution
Another probability distribution in DRF adopted from Craig, Christina, and Iadh in [1] is PL2F. PL2F is derivative from PL2. The relevance of weighting score of a document d j for a query terms is given by (10): score(d j , q) = t∈q qtw · 1 tf n+1 (tf n · log 2 tf n λ + (λ − tf n) · log 2 e + 0.5 · log 2 (2φ · tf n) where λ given by F n , F is the frequency of the query term t in the collection, and n is the number of document in whole collection. The query term weight qtw is given by qtf qtfmax , where qtf is frequency of query term and qtf max is the maximum frequency of query term among the query terms. The tf n corresponds to the normalized term frequency which is given by (11): where tf f is the frequency of the term t in the field f of the document d j ; avg l f is the average length of whole documents in the collection; and l f is the length in tokens of field f of the document d j . The w f and c f are the hyper-parameter that controlled the contribution of watch field. Based on the condition of Sundanese ancient manuscript in this study, the field that use in both parameter are the body of the document with values c = 4.10 and w = 1.

Experiments and Results
The collection of Sundanese terms obtained from Sundanese ancient manuscript describe in Table 1.
When the weighting term process done, each terms has their own weight, then the set of query randomly picked to run in the three probabilistic approaches. The parameter values for three different probabilistic methods shown in Table 2. The set of queries in the proposed system. The results of search shown in interactive interface on the browser. The example of search results can be seen in Figure 2. responds given by DFR-PL2F and the lowest respond given by BNM. The ratio of query latency of three methods is 16:1:1 respectively to BNM, PL2, and PL2F. The complete results of average time respond that indicates the query latency can be seen in Table 3. Based on an experiment, there is no difference of mean average precision for the three probabilistic methods. The 50 queries that run in the system gave the MAP value about 0.34. The difference of mean average precision can be seen when the proposed system tested by two terms of query type. Even though three probabilistic methods have a derivation of MAP value, but the best performance given by DFR-PL2F with 0.22 for 50 queries. Figure 3 shows the MAP value for the two terms query that applied in the proposed systems.