Mathematical model for optimizing distributed information systems

An approach to optimizing distributed information systems is discussed in this paper. An argument explaining the demand for such distributed information systems nowadays is presented as well. The topic is crucial because it is impossible to find an optimal solution due to the curse of dimensionality of an NP-complete problem. The problem of data distribution by nodes of a distributed information system was defined. A mathematical model for optimizing distributed information systems was developed to solve the problem. In Stage 1 an analysis of criteria optimization of resource distribution among distributed information systems was carried out. The following definitions were established in this paper: information entity, attribute of the entity, distributed database. From the perspective of distributing entities among nodes a distributed information system was determined. As a result of describing elements of the mathematical model, a function of information flows was obtained. The function deals with data flows during distributed query execution. Including certain restrictions, such as the prohibition of hosting one entity on different nodes, the problem of distributing entities among nodes is presented. It is established that the problem of distributing resources is related to integer programming which, by itself, is concerned with optimizing distributed database structures, where the objective function indirectly depends on the variable of distribution and presence of restrictions.


Introduction
Currently, plenty of enterprises employ distributed information systems (DIS) for their business. The reason for such a trend, compared to local systems, is that a DIS raises productivity while lowering costs. Not only does that lead to an increase in the dimensions of a DIS but it also comes to the point where an amount of stored and transmitted data also grows. Consequently, there's a challenge for developers to boost the productivity of a DIS even further. A solution to this problem can be achieved by optimizing resource distribution by nodes [1,2,3].
The topic is especially important considering that it is impossible to find an optimal solution because of the curse of dimensionality of an NP-complete problem [4].

Mathematical model
The problem of data distribution and hosting will be referred to as the resource distribution and hosting problem on DIS nodes.
Choosing an optimization criterion, which assesses how effective a DIS is, is the beginning of a solution to the problem. An Information Entity (s) is «a specific or abstract object and connections between the objects». An Attribute of the Information Entity (A) is «an observed property of the entity or observed property of a connection between entities of a problematic area».

( )
The given DIS is designed to solve a set of problems where S is a set of entities used for solving the problem; r1=1, …, R1, …, rW = 1, …, RW, w = 1, …, W is the entity index number, which determines the location of the entity within the dependency hierarchy, thus independent entities of level w = 1 have a single-digit index (e.g., s1, s2, …, sR1), whereas the subentities entering level w = 2 have a two-digit index (e.g., s1,1, s1,2, … , s1,R2); © is a combining operation which enables obtaining of entities of the previous level. The concept of creating indices of entities is shown in figure 1. The description of entities is given in the form of a reference database directory. The following part is devoted to describing the DIS from the perspective of distributing entities among network nodes.
The structural description of entities is provided in Table 1.
Let us suppose that on every l-th node a strictly defined set of problems is being solved: Let us introduce a new definition, Frequency Response of a Problem (FRP), which is a number of repetitions of every problem being solved on the l-th node per unit of time.
The FRP on the l-th node can be given as: where l is the node number where the problem is being solved; n is the problem number.
The following set of entities is applied to solve the n-th problem:   l,n l,n,k l,n l l,ñ *~* *l ,n l,n l,n l,n l,n l,n l,n l,n where Dl,n is a set of entities used for solving the n-th problem on the l-th node. The set Dl,n is an injection of the set S into the set Fl. At the same time the set Dl,n consists of Dl,n * and Dl,n~. The former one is a set of nonrelocatable entities used for solving the n-th problem on the l-th node, while the latter one is a set of relocatable entities used for solving n-th problem on the l-th node.  Let us introduce a term, Frequency Response of the Entity (FRE), which is a number of queries to the entity dl,n,k in the process of solving the n-th problem on the l-th node per unit of time.
The FRE can be given as: where l is the node number where the problem is being solved; n is the problem number where the entity is used during the solution; k is the entity number. Ф is the injection of the set S into the set V. Ф is a set of entities hosted on nodes including l, which is a set of entities hosted on the l-th node. l also has l * , a set of nonrelocatable entities, and l~, a set of relocatable entities: The formula of obtaining the cost of delivering an information entity from an origin node to a destination node, where the problem will be solved, is as follows: where и l,l R is the lowest cost path connecting the source and host nodes;  is the ark belonging to the path. Table 2 contains the description of the mathematical model elements for data distribution if a DIS is active. Table 2. Description of mathematical model elements.

Elements
Mathematical notation Network G ( )  Table 2. Description of mathematical model elements.

Elements
Mathematical notation Set Hl,n,k of attributes describing the k-th entity in the process of solving the n-th problem on the l-th node   l,n,k l,n,k,i l,n,k l,n l,n is the cost of delivering data from an origin entity lи to a destination node l (llи).
The following restrictions must be defined in the process of solving the entity distribution problem on DIS nodes: • one information entity mustn't be located on different nodes (12) where l is the size intended for data distribution and reserved by nonrelocatable entities; l,n,k  is the number of instances of an entity [5].
The first restriction exists due to the fact that optimal data distribution does not imply any duplication.
In the meantime, the issue of data storage security requires additional measures. That is why there should be redundant nodes on the network to ensure data security since they will store current replicas of a DIS.
The second one is related to the hardware capabilities of a node. It allows controlling the value of input information on a node (l). For nodes, which lack the capability of sharing their resources with others, this value is equal to 0.
Considering the aforementioned information, the problem of entity distribution among nodes can be determined as follows: define such entity distribution (Ф) among DIS nodes which would lower the objective function (query processing time in the process of solving problems on a given interval) (11) with restrictions (12) and (13) in place Ф Ф min Z(Ф).

→ (14)
The problem of distributing resources is related to integer programming which, by itself, is concerned with optimizing distributed database structures, where the objective function indirectly depends on the variable of distribution and presence of restrictions.