Modelling and Simulation of Search Engine

The best tool currently used to access information is a search engine. Meanwhile, the information space has its own behaviour. Systematically, an information space needs to be familiarized with mathematics so easily we identify the characteristics associated with it. This paper reveal some characteristics of search engine based on a model of document collection, which are then estimated the impact on the feasibility of information. We reveal some of characteristics of search engine on the lemma and theorem about singleton and doubleton, then computes statistically characteristic as simulating the possibility of using search engine. In this case, Google and Yahoo. There are differences in the behaviour of both search engines, although in theory based on the concept of documents collection.


Introduction
To access or search for information in an information space or system, we need tools [1]. One of tools is the search engine, we know as a software system [2,3]. In general, for helping to know and understand a system, we use the model to assemble it such that mathematically a model can represent the search engine [4]. Whereas, simulation can used for estimating the effect of search engine model on the information space or system [5].
There are many different search engines. The search engine that arises naturally with the database or search engine that grew up with the web (web search engine) [6,7]. Dealing with the complexity of information, the search engines helpless and disappear, the search engine shifts to meet the capabilities required, or the search engines changed clothes and present be new. Therefore, all this will affect access to information in space. In this case, the mathematical principle is not only used to systematize, but it serves to optimize the creation of a search engine on information space. This paper aimed to express the characteristics of search engine based on the constraints in the information space.
Suppose we denote the information space or system such as Ω [8]. The information space contain the groups of documents or D [9]. Each group of documents consist of documents dj whereby there a word w, i.e. the basic unit of discrete data, defined to be an item from a vocabulary indexed by {1,…,K}, wk = 1 if k in K or wk = 0 otherwise [10,11]. Next, we define the terms related to the word. Definition 1. A term tx coincide with at least one or more words, i.e. tx = (wl|l=1,…,L), k ≤ l, l is a number of parameters representing words w, l is the number of vocabularies in tx, |tx| = l is the size of tx.

Definition 2.
Suppose Ω is a set of documents indexed by search engine, i.e., a set consists of the ordered pair of the terms tli and documents dlj, or (tli,dli), i=1,…,I, j=1,…,J. The relation table is two columns tl and dl as a representation of search engine whereby Ωl = {(tl,dl)ij} is a subset of Ω. The size of Ω is denoted by |Ω|.
Definition 3. Let tl is a search term and q is a query, then tl in q for tl in dl, dl in Ω.
In logical implication, Definition 3 express that a document is relevant to a query if it implies the query, that is if d=>q is true or d=>tl is true for all d in Ω: (d =>tl) = 1. Thus, the degree of d=>q measured by P(d=>q). Therefore there are the uniform mass probability function for Ω, i.e.

Definition 4.
Suppose tx is a search term or tx in S whereby S is a set of singleton search terms of search engine. A vector space Ωx, be a subset of Ω, is a singleton search engine event (singleton) of documents that contain an occurrence of tx in dx.
The same meaning of Ωx as subset of Ω is if d=>tx has true value, or Ωx(tx)≈1 if tx is true at d in Ω or Ωx(tx) ≈ 0 otherwise, and the cardinality of Ωx be |Ωx| = ΣΩ(Ωx(tx)≈1). In other word, each document that is indexed by search engine contains at least one occurrence about the search term. In degree of uncertainty of d=>tx on d=>q means that However, if search term in pattern, like tx = "Mahyuddin Khairuddin Matyuso Nasution", then a different result appears. In other words, Ωxp("tx")=1 if tx is true at d in Ω exactly or Ωxp("tx")= 0 otherwise, and the cardinality of Ωx be |Ωxp| = ΣΩ(Ωxp("tx")=1). In this case, each document that is indexed by search engine contains at least one occurrence of a search term. In degree of uncertainty of d=>"tx" on d=>q is Thus |Ωxp|/|Ω| ≤ |Ωx|/|Ω|, so |Ωxp| ≤ |Ωx| or Ωxp is a subset of Ωx.
Let tx and ty are two different search terms. If tx = ty, tx ≠ ty, or |tx|<|ty|, then Ωxp be a subset of Ωx or Ωyp be a subset of Ωy or Ωxp be a subset of Ωy or Ωyp be a subset of Ωy.
Let tx and ty are search terms, refer to the definitions above, will be revealed some characteristics related to the search engine as a system. All characteristics derived from the adaptation formula that build model of the problem completion relating to the possible the results of the search engine. Some of the adaptive characteristics are as follows [12,13,14].
Similarly,  (7) and Equation (8) The purpose of simulation, in this case, is to construct an approach for selecting the documents in information space or for disclosing the information in the repository [15]. As an experiment to collect data, which is to select n objects from the community.
In the sample that can represent population, we develop a table of information as experiment design for providing data, Table 1. Data that reveal characteristics of a search engine. In the table, the first column is the actor's names alphabetically ordered. The second column contains academic level: It is used to test whether the sample is random, the academic level as medium of randomness test (mrt). The third column involves data of scientific publications indexed by Scopus whereby the actor consists of two categories: the author or not, data of scientific publications as the comparative mrt. It is intended to support the randomness test of sample. The next columns contain the list of singletons respective to tx and tx in quotes, and a list of doubletons of tx and ty (singleton with keyword). In this case, we ensure that the singletons also are random.
In general, the information space consisting of documents viewed as the population. Statistically, the population is random, and it was tested whether the characteristics also lowered to the sample, so that any measurement about sample describe population. We seperate the sample into two categories: number of first categories whereby ai1 is elements of A that meet first category and ai2 is elements of A that meet second category. While run (r) is how many times the category change in the sample. Thus, the average of run is μr = (2n1n2/(n1+n2))+1 (3) and the variance of run is Then, we have Zcount as follows Zcount = (r -μr)/σr (5) for hypotheses used are as follows: H0: the data sequence is random, and H1: the data sequence is not random. For academic level as category: professor (pr) or lecturer (lc), we have n1 = 34 and n2 = 17. By using Equations (3), (4), and (5), we obtain μr = 23.67, σr = 0.93 and Zcount = -1.79, and for α = 0.05 we obtain Zα=-0.025=1.96 ≤ Zcount ≤ Z=0.025 = 1.96, and because r is located between the critical value then the decision is received H0. Seen from the publication of scientific papers indexed by Scopus: author (a) or not (n), we have the similar conditions such that the sequence of data is random. Furthermore, to test the randomness perfectly, tested independence of two data space by using chisquare (χ 2 ). Suppose the data space (ds) is presented in matrix form as follows, Amount of data xij is Sij as follows Sij = Σi=1,…,n,j=1,…,m xij . Amount of data eij is Eij as follows Then, we have χ 2 = Σi=1,…,n,j=1,…,m (xij-eij)/eij (9) with degree of freedom (df) is (m-1)(n-1). For example, among 51 actor names we have x11 = 34 professors, x21 = 17 lectures, x12 = 17 authors, and x22 = 34 non-authors. Based on Eq. (7) we can calculate their expectations, i.e. e11 = e12 = e21 = e22 = 25.5, and based on Eq. (9) we obtain χ 2 = 11.33 for test statistic T as chi-squared distribution with (m-1)(n-1) = (2-1)(2-1) = 1 degree of freedom and the acceptance region for T with a significance level of 5% is 3.841, then rejects the null hypothesis of independence because χ 2 > 3.841. This tell us there is a relationship between type of academic level and authors.
In reveal characteristics of search engine based on a model, we conduct an experiment about singleton and doubleton of Google search engine as test simulation and of Yahoo search engine as comparative simulation as follows.
(9), the χ 2 = 18.98 greater than 12.59 for df = 6 and α = 0.05 such that H0 rejected. Therefore, all the data as a whole is dependent.
In general, a collection of documents in information space and indexed by a system be random, see randomness test (1a, 1c, 1d, 1e, and 1f), and information space Ω has a normal distribution, where Eq. (1) be the uniform mass probability function. A row of data in A is random with a confidence level of 95%.
Although the same characters can be derived based on set theory, but singleton from different search engines are not interdependent. So the information presented freely with each other, caused by each search engine has its own potential and capabilities. There are different potential between Google search engine and Yahoo search engine. In Google search engine, the singletons and doubleton are dependent. Whereas in Yahoo search engine, the singleton and doubleton are independent. Therefore, an information space such as system have information tied to each other, but in different sub-systems can be built mutually bound: Google search engine and Yahoo search engine, for example, as different subsystems.