Keyword Search over Data Service Integration for Accurate Results

Virtual Data Integration provides a coherent interface for querying heterogeneous data sources (e.g., web services, proprietary systems) with minimum upfront effort. Still, this requires its users to learn a new query language and to get acquainted with data organization which may pose problems even to proficient users. We present a keyword search system, which proposes a ranked list of structured queries along with their explanations. It operates mainly on the metadata, such as the constraints on inputs accepted by services. It was developed as an integral part of the CMS data discovery service, and is currently available as open source.


Introduction
Virtual Data Integration (VDI) is a lightweight 1 approach to integrate heterogeneous data sources where data physically stays at its origin and is requested only on demand [1]. It works as follows: (i) queries are interpreted and sent to relevant services; (ii) the corresponding responses are consolidated with the removal of inconsistencies in data formats and entity naming, and (iii) the responses are finally combined. However, this approach forces the users to learn the query language and to get familiar with data organization, which is often not straightforward, especially without a direct access to the data at the services.
In this work, we present a keyword search system which simplifies the interaction with VDI by proposing a ranked list of structured queries. The system operates "off-line" using metadata, such as constraints on inputs accepted by services. It was developed at CMS, the Compact Muon Solenoid experiment at CERN, where it makes a part of an open source VDI tool presented next.

Data Aggregation System: a tool for virtual data integration
At the CMS experiment at CERN, the Data Aggregation System (DAS) [2][3][4] integrates several services, where the largest stores 700 GB of relational data. DAS has no predefined schema, thus only minimal service mappings are needed to describe differences among the services. It uses simple structured queries which consist of an entity to be retrieved, and some selection criteria. Optionally, the results can be further filtered, sorted, or aggregated. As be seen in Figure 1, DAS queries closely match the physical execution flow demanding users to be aware of it (motivated by large amounts of data the services manage). The proposed keyword search approach relaxes this need of knowing the internals.

Problem definition and solution overview
Given a keyword query, kwq = (kw 1 , kw 2 , .., kw n ), we are interested in translating it into a ranked list of the best-matching structured queries composed of: • schema terms: entities and their attributes (inputs to the services or their output fields).
• value terms: for some fields a list of values exist; but for the most only the constraints on data-service inputs are available, e.g. regular expressions defining values accepted. For example, in Figure 3, 'dataset' can be both an entity or input name, 'dataset.nevents' is an output field, whereas 'RelVal' can be an input value of dataset, primary dataset, and group fields. The keyword search works as follows. In the first step the keyword query is cleaned up and tokenized identifying any quoted tokens or other structural patterns.
In the second step each token is assigned with a list of its interpretations and a rough estimate of each interpretation's likelihood (called entry point). By using a mixture of entity matching techniques each token is considered individually and interpreted as either schema or value term.
In the third step the permutations of entry points, representing a matching of keywords into their interpreted meanings, are enumerated and ranked by combining the entry point scores.
Example. Consider the following keyword query: RelVal 'number of events'> 100.
Tokenization results in: ' RelVal'; 'number of events>100'. The entry points include: Finally, execution of the third step yields a ranked list of query suggestions as shown in Figure 3. For more details on the internals of the processing steps see Section 4. To aid users in typing the queries, live context-dependent suggestions are shown and the different parts of a query are colored by their semantic role (see Figure 4). The autocompletion is based on CodeMirror[5], a versatile JavaScript-based text-editor.

Tokenization (step 1)
At first, the query is standardized (e.g., removing extra spaces, standardizing date formats). Next, the query is tokenized recognizing phrases in quotes and operator expressions (e.g., nevent > 1, 'number of events'=100, 'number of events>=100'). Stop words (e.g. a, which, when) are identified and given less importance in the later processing steps.

Entry point generation (step 2)
This step matches each token into a schema or a value term. For each token, a list of its interpretations is obtained, including a score providing a rough estimation of interpretation's likelihood when considered individually ignoring influence of other nearby tokens. For matching value terms, a listing of known values is used when available (several cases are distinguished: a full match, a partial match, and a match containing wildcards). Otherwise, regular expressions constraining inputs, accepted by services, are used. Also, the unlikely interpretations are penalized, e.g. entity names and numbers are questionable wildcard value matches (the values of dataset are string typed, but contain numbers and an entity name 'block' ).
The schema terms are matched by checking each keyword for: full, lemma, and stem matches, and a stem match within a small string edit-distance (in the order of decreasing scores).
Finally, matching fields in service outputs involves identifying multiword keyword chunks corresponding to the field names (e.g. number of events in a file → file.nevents). Many of these names are directly extracted from service responses in JSON or XML formats, which were not directly intended to be used by humans: names contain irrelevant or common terms. Thus, inspired by [6], we employ whoosh [7], an Information Retrieval (IR) library, where for each field in service outputs we create a "multifielded document" which contains the field name, its parent, its base-name, and its title if one exists. To find the matches, the IR engine is queried for each chunk of k-nearby keywords, using k ≤ 4. The IR ranker uses the BM25F scoring function [6], where "document fields" are assigned different weights and phrase matches are scored higher. Finally, the IR score is directly used as a score for the generated entry point match.

Ranking the query suggestions (step 3)
In this step various permutations of the entry points, forming mappings between keywords kw i into their interpretations tag i , are evaluated and ranked. The score of a mapping is obtained by combining the entry point scores, score tag i |kw i , and scores returned by contextualization rules, h j ∈ H. The latter aim to account for keyword interdependencies by promoting permutations where (i) the nearby keywords refer to related schema terms, e.g., entity name and its value or (ii) matching frequent use-cases, e.g. retrieving entity by its "primary key".
We experimented with two ranking functions: the average of scores (1), as used in Keymantic [8,9] (see Section 5), and the sum of log-likelihoods (2). At first the probabilistic approach (2) seemed more sensitive to inaccuracies in entry point scoring, but after improvements to the accuracy of entry point generation it became clearly better than the averaging approach (1).

Implementation details
The keyword search as well as the DAS data integration system is implemented in Python.
The ranker simply enumerates all the possible mappings with early pruning of suggestions that are not supported by the services. There exist more complex alternatives, but this one is the simplest that allows early pruning, unlimited contextualization, and listing multiple results. The ranker is implemented in cython [10] which, by compiling the critical parts into a bare C code, allow to easily obtain sufficient performance even if exhaustive search is performed.

Related work
Keymantic [8,9] answers keyword queries over relational databases with limited access to the data instances. First, individual keyword matches are generated using similar techniques as presented in Section 4.2, but focusing on a less specific domain than ours. Second, to obtain the global ranking, the assignment problem (associating each keyword with a tag/interpretation) extended with weight contextualization 2 is considered (see Section 6). The resulting labels are interpreted as SQL queries and presented to users. We noticed that summing the log-scores instead of plain scores as used in Keymantic, gives a better ranking quality, especially, if these scores are good approximations of the respective likelihoods (see Section 4.3).
The KEYRY [11] took a different approach to the earlier problem allowing to incorporate users feedback. It uses a sequence tagger based on Hidden Markov Model (HMM). The initial HMM parameters can be estimated through heuristic rules (e.g., promoting related tags). To produce the results, the List Viterbi algorithm [12] is used to obtain top-k most probable keyword taggings, which are later interpreted as SQL queries. Once sufficient amount of logs or users' feedback is collected, the HMM can be improved through supervised or unsupervised training [13]. The accuracy of this method is comparable to that of Keymantic [11].
Finally, Guerrisi et al. [14] focused on answering full-sentence open-domain queries over the web-services by using techniques of natural language processing. Instead, we focus on closeddomain queries without restricting the input to full sentences.

Discussion: Contextualized Weighted Assignment Problem
At a larger scale, more efficient ranking algorithms are need than discussed in Section 4.3 (and they would add additional assumptions or complexity). In Section 5, the contextualized weighted assignment problem was introduced as one of approaches to the keyword search over VDI. In this section, stepping away from our current implementation, we discuss the theoretical basis of this approach and the related works. We raise some unresolved issues and propose an efficient algorithm for a special case when the number of contextualizations is very low.

Introduction: Weighted Assignment Problem
Given n keywords and m tags, n m, and a n×m matrix of scores (called weights), the assignment problem, reformulated for the keyword search, asks to find a maximum weighted bipartite matching -maximize the sum of weights such that each keyword is assigned to one tag, and a tag is chosen no more than once. This can be efficiently solved in Θ(n 2 m) by Munkres algorithm. In short, it splits the assignment problem into two easier ones (see [15][16][17] for details): (i) Maintain a set of constraints that restrict the currently admissible matches to be "cheap enough" (ii) Solve n unweighted bipartite assignments: start with an empty matching, find an augmenting path to increase the size of matching. Along this path, flip the state of edges: match new ones or deselect the matched ones; if no augmenting path exists, loosen the constraints on the weights To efficiently list k best results one can use Murty's algorithm [18] running in in Θ(kn 3 m). To get each additional result, it involves solving n−1 smaller assignments with Munkres. We've seen that solving the basic Weighted Assignment Problem can efficiently generate the assignments, however this still would not account for interdependencies between the keywords interpretations.

Algorithm 1 top-k assignments with limited contextualization (sketch)
1. Solve the problem without contextualizations once, Θ(n 2 m). The result will be used in the later steps. 2. Enumerate all C contextualization possibilities: (in the depth-first order, to reuse matrix modifications) 2.1. Use Murty's algorithm [18] to get top-k results over contextualized cost-matrix.
-further, the expected runtime of Murthy's algorithm can be considerably improved [20]. 3. Merge all of the top-k solutions found in step 2.1.

Supporting the Contextualizations and Issues with that
Even if keyword queries have no clear structure such as the natural language, it was noticed that the nearby keywords are often related [19]. To account for these likely dependencies, the contextualization adjusts the scores depending on the context, e.g., say keyword kw i was assigned tag i , then the nearby keywords become more likely to get the tags related to tag i , and thus their scores get increased.
To support this in Keymantic [8,9] some internal steps of Munkres algorithm were modified. When the size of the matching is increased, the newly matched cells are contextualized, while the unmatched ones are uncontextualized: this triggers weight updates in the dependent cells.
However, the problem with this modification is that unmatching a currently matched cell may lead to weight updates in some other currently matched cells, possibly making them not admissible anymore. Consequently, this may lead to a violation of some of the assumptions of the algorithm, e.g, that each iteration increases the size of matching [15, p. 250]. As a result, we suggest that further investigations are needed.

Solution for a low number of contextualizations
We are not aware of any method allowing to efficiently compute the top-k optimal solutions to the earlier problem (Keymantic can efficiently list only the approximately best results). Fortunately, the problem is simpler if the number of all contextualization possibilities is low : one could simply enumerate all of contextualization possibilities and combine solutions to the basic assignment problem. It is worth observing that, in this special case, there exist large similarities between the contextualized cost matrices, thus parts of sub-solutions can be reused. As this is out of the scope of this work, only a brief idea is provided in Algorithm 1.

Conclusions
We have presented an implementation of a keyword search over a virtual data-service integration, adapted to the specifics of the CMS Experiment. The users' feedback has shown that in data integration, which provides only a limited access to explore the data, an interactive autocompletion can be successful in helping the users to compose semistructured queries.
The public availability of corporate, governmental and other data services is increasing as well as the popularity of repositories and tools for integrating them 3 . Whereas, the availability of user-friendly interfaces is becoming an increasingly important issue. Future challenges may include answering the queries over much larger numbers of data tables and data services, and answering the queries more complex than the ones considered in this work.