Combined string searching algorithm based on knuth-morris- pratt and boyer-moore algorithms

The string searching task can be classified as a classic information processing task. Users either encounter the solution of this task while working with text processors or browsers, employing standard built-in tools, or this task is solved unseen by the users, while they are working with various computer programmes. Nowadays there are many algorithms for solving the string searching problem. The main criterion of these algorithms’ effectiveness is searching speed. The larger the shift of the pattern relative to the string in case of pattern and string characters’ mismatch is, the higher is the algorithm running speed. This article offers a combined algorithm, which has been developed on the basis of well-known Knuth-Morris-Pratt and Boyer-Moore string searching algorithms. These algorithms are based on two different basic principles of pattern matching. Knuth-Morris-Pratt algorithm is based upon forward pattern matching and Boyer-Moore is based upon backward pattern matching. Having united these two algorithms, the combined algorithm allows acquiring the larger shift in case of pattern and string characters’ mismatch. The article provides an example, which illustrates the results of Boyer-Moore and Knuth-Morris- Pratt algorithms and combined algorithm’s work and shows advantage of the latter in solving string searching problem.


Introduction
The searching problem is one of fundamental tasks of theoretical programming [10]. String searching is one of simple, but, nonetheless, extremely important problems. The importance of this problem is explained by the wide area of its solution results" application: text editors, online string matching, speech analysis and recognition, information retrieval, network content inspection, data compression, etc. [4].
Nowadays there are many algorithms for solving this problem developed. During the last thirty years numerous algorithms, which allow solving the problems of searching with some special characteristics, were offered. M. Ahmed, M. Kaykobad and R.A. Chowdhury developed a new string matching algorithm that unlike other sub-linear string-matching algorithms never performs more than n text character comparisons while working on a text of length n [1]. T. Lecroq proposed string matching algorithms based on hashing q-grams [9]. K. Fredriksson and S. Grabowski developed new exact bit-parallel string matching algorithm, based on the shift-or algorithm [2,5]. They have shown how to adapt their techniques for the shift-add algorithm, obtaining optimal time for searching under Hamming distance [5]. L. He, B. Fang and J. Sui have used the concept of a window whose size is equal to pattern length. They presented a novel string matching algorithm named wide window algorithm [6]. A. Hudaib et al. in [7] presented two sliding windows algorithm. The algorithm makes use of two sliding windows. Both windows slide in parallel over the text until the first occurrence of the pattern is found or until both windows reach the middle of the text. The inexhaustible interest towards this sphere of problems confirms the importance and topicality of the string searching problem. This article offers an algorithm based upon two well-known and acknowledged string searching algorithms. This kind of combination allows increasing the effectiveness of solving the given problem.

Theoretical bases of the combined algorithm
The combined string searching algorithm is based upon two algorithms. One of them was developed by Knuth, Morris and Pratt, and the other one by Boyer and Moore. These algorithms belong to two quite big sub-classes of string searching algorithms. Knuth-Morris-Pratt algorithm belongs to forward pattern matching sub-class, and Boyer-Moore belongs to backward pattern matching sub-class. Two stages can be identified in the work of these algorithms: 1. Preparing of table d used in case of shifting of the pattern in the string. 2. String searching itself. Let us have a look at the work of these string searching algorithms in detail. The description will be followed by examples and explanations with the help of C language.

Principles of Knuth-Morris-Pratt string searching algorithm
The algorithm was developed by D. Knuth and V. Pratt, and independently by J. H. Morris in 1974, but it was published by them collaboratively only in 1977 [8]. This algorithm is based upon the idea that after partial matching of the initial part of the pattern with the corresponding characters of the string, we get certain information on the basis of the pattern itself, and this information will allow us to move forward along the string not by one, as in case of naive string matching, but further. In order to do that, when shifting, Knuth-Morris-Pratt string searching algorithm (or KMP algorithm) uses table d, which is preprocessed before the beginning of string searching. Since table d preprocessed in compliance with KMP algorithm will also be used in combined algorithm, let us denote it as d KMP .
Thus, table d KMP for Knuth-Morris-Pratt string searching algorithm is preprocessed on basis of the pattern and contains values, which will be used further for calculation of the shift amount of the pattern. The size of the table is equal to the length of the pattern. Therefore, table d KMP is in fact a one-dimensional array, consisting of the number of elements equal to the number of characters in the pattern.
The first element of the array d KMP is always minus one. Let us have a look at the formation of the table d KMP in the example of the pattern "barbarian". As it is shown in Figure 1, the first element of the table d, corresponding to the first character of the pattern "b" is equal to minus one.  maximum number of characters, which directly precede the given character and match with the beginning of the pattern. If k characters precede the given character, then only k−1 preceding characters are taken into account.
Thus, Figure 2 shows that one character "b" directly precedes the fifth character "a", and matches with character b at the beginning of the pattern. Both characters "b" are in bold type. Since the number of matching characters is equal to one, element of the table, corresponding to the fifth symbol of the pattern "a" receives value one (see Figure 2). Two characters "ba" directly precede the sixth character of the pattern r and match with the characters "ba" at the beginning of the pattern. Both pairs of characters "ba" are in bold type. Since the character "r" is preceded by two characters, matching with the first two characters of the pattern, the corresponding element of the table is given the value 2 (see Figure 3). No other characters of the pattern "barbarian" are preceded by characters, matching with the beginning of the pattern; therefore the corresponding elements of the table d KMP are equal to zero (see Figure 4). The second stage of KMP algorithm work is comparing characters of the string and characters of the pattern and calculating the shift amount in case of their mismatch. The characters are considered from the left to the right, i.e. from the beginning to the end of the pattern. In case of mismatch of characters of pattern and string, the pattern is shifted to the right along the string by following value: , in which j is an index of the currently considered character of the pattern, d KMP [j] is the value of the table d KMP , corresponding to this character. Figure 5 shows an example illustrating the shift of the pattern during the work of KMP algorithm.  Let us give an example of string searching of the pattern "barbarian". Figure 6 shows the process of KMP algorithm work. The characters that are being compared are underlined. b a r i s f u l l o f b a r b a r i a n s b a r b a r i a n b a r b a r i a n b a r b a r i a n … b a r b a r i a n b a r b a r i a n Pay attention to the fact that in every case of characters" mismatch, the pattern is shifted by the whole passed length, because smaller shifts cannot lead to full match.
As far as the effectiveness of Knuth-Morris-Pratt string searching algorithm is concerned, its developers show that about n + m character comparisons are needed, which is better that n * m comparisons in case of naive string matching (where n is length of the string, m is length of the pattern, 0 < m < n) [8,10]. In this case, the string scanning indicator i never goes back, while in case of naive string matching, after a mismatch, the considering starts again from the first character, and therefore the string characters that have been considered before might be processed again.
However, string searching with the use of Knuth-Morris-Pratt string searching algorithm is helpful only in case if the mismatch of string and pattern characters was preceded by some number of matches. If the comparison of the string with the pattern shows that first characters are different, the pattern shifts by only one character. The pattern shifts more than by one only in case if several characters of string and pattern match.

Principles of Boyer-Moore string searching algorithm
In 1977 R.S. Boyer and J.S. Moore offered an algorithm, which not only improves the processing of the worst case, but also gives and advantage in general case [3]. Boyer-Moore string searching algorithm (or BM algorithm) is based upon an unusual idea -characters" comparing starts not from the beginning, but from the end of the pattern. The speed of Boyer-Moore algorithm"s running is achieved at the expense of omitting parts of the text, which certainly do not participate in successful comparison.
As well as in case of KMP algorithm, table d is preprocessed basing upon the pattern before the beginning of searching, and then used when shifting of the pattern along the string. Since this table will also be used in the work of the combined algorithm, let us denote it as d BM .
Initially all the elements of the table d BM are given the values equal to the length of the pattern.
On the next stage, every element of the table d BM is given a value equal to the remoteness of the corresponding character of the pattern from the end of the pattern. Figure 7 shows the pattern and remoteness of each of its characters from the end of the pattern. In case if a pattern contains several identical characters, the element of table d BM corresponding to this character is given the value equal to remoteness from the end of the pattern of the rightmost character [10]. For example, the pattern "barbarian" contains three characters "a", and remoteness of the rightmost from the end of the pattern is equal to one. In connection with this, the elements of table d BM corresponding to the rest of characters "a" in the pattern are given the value equal to one (see Figure 8). Similarly, elements of table d BM corresponding to all the characters "r" are given the value of three, and the elements corresponding to characters "b" will have the value of five. During the implementation of Boyer-Moore string searching algorithm, as well as during the implementation of the combined algorithm, it is recommended to use the table of character codes ASCII for creation of table d BM .
Every printed character has its own ASCII code. For example, the code of the character "a" is 97, the code of "r" is 114, and the code of space is 32. It can also be mentioned that the codes of Cyrillic characters are situated between 192 and 255. For example, ASCII code of character "a" is 97. The value of the elements of table d BM corresponding to it, as it was already shown on Figure 9, is equal to one. Therefore: It can also be written in the C language: d["a"] = 1.
It can also be mentioned that during the implementation the element of array d with the index of 97 corresponds to all the characters "a" in the pattern "barbarian". Thus, all the elements of table d BM corresponding to the characters "a" of the pattern in such case are given the value of 1 (see Figure 9).
The values of the elements of table d BM corresponding to characters "i", "r", "b" change in the similar way: The necessity of assigning value of one and not zero (since its remoteness from the end of the pattern is equal to zero) is connected with the peculiarities of calculation of shift amount directly during the string searching.
The second stage of BM algorithm work is string searching itself. While comparing the pattern and the string, the pattern moves from the left to the right along the string. However, pattern and string characters are compared from the right to the left along the pattern. That is the peculiarly of backward pattern matching.
The comparison of the pattern and the string is carried out 1) until the whole pattern is considered, which indicates that there is a match between the pattern and some part of the string, 2) until the string ends, which means that there are no entries matching the pattern in the string, 3) until there is a mismatch of the pattern and string characters, which leads to shifting of pattern by several characters to the right and continuing the searching process.
In case of characters" mismatch, shift of the pattern along the string is defined by the value of the element of table d BM . However, the index of the given element is the ASCII code of the character of the string. It can be emphasized that although array d is formed on the basis of the pattern, the shift is defined by the mismatching character of the string. Figure 9 shows an example of BM algorithm work. The characters that are being compared are underlined. b a r i s f u l l o f b a r b a r i a n s b a r b a r i a n b a r b a r i a n b a r b a r i a n b a r b a r i a n Figure 9. Process of work of Boyer-Moore string searching algorithm. In the first iteration there was a mismatch of the pattern character n and string character u. It should be pointed out that the values of all the elements of array d for all the possible characters on the stage of initialization are equal to the length of the pattern, i.e. to nine. Since character "u" is not found in the pattern, the value of the element d ["u"] in further formation of the table is not changed and remains equal to nine. That is why the pattern is shifted by nine characters to the right. If these two characters matched, the last but one pattern character and the corresponding string character would be considered next, etc.
While comparing the pattern and the string there is a mismatch of characters "n" and "r". Once again, defining the shift with the help of the string character (d["r"]), we get the value of three. The pattern shifts to the right by three characters. Similarly, the pattern gradually shifts along the string, until the pattern is found in the string or until the string ends.
The evaluation of BM algorithm effectiveness shows that almost in all cases the algorithm demands far less n comparisons. In the most favorable conditions, when the last pattern character falls into the mismatching string character, the number of comparisons is n / m [3, 10].

Combination of Knuth-Morris-Pratt and Boyer-Moore string searching algorithms
Based upon the mentioned principles of Knuth-Morris-Pratt и Boyer-Moore string searching algorithms concerning the creation of tables for defining the shift and string searching itself the following combined algorithm has been offered.
Step 1. Creating table d KMP according to the principles of Knuth-Morris-Pratt string searching algorithm.
Step 2. Creating table d BM according to the principles of Boyer-Moore string searching algorithm.
Step 3. Defining the initial value of index i, corresponding to the position of the pattern relative to the string.
Step 4. Defining the initial values of indices j KMP and j BM , which indicate the beginning and the end of the pattern respectively.
Step 5. Comparing the pattern character with index j KMP and the corresponding string character, comparing the pattern character j BM and the corresponding string character. If at least one comparison ends with a mismatch, go to Step 11.
Step 6. If j KMP is less than j BM , go to Step 9.
Step 7. Output of the message informing that the pattern matches with a part of the string.
Step 9. Increasing index j KMP by one (i.e. going to the next character to the right), decreasing index j BM by one (i.e. going to the next character to the left).
Step 11. Choosing the larger shift from j KMP -d KMP [j KMP ] and d BM [i + pattern length -j BM ].
Step 12. Shifting the pattern to the right relative to the string, increasing the value of index i by the shift defined in Step 11.
Step 13. If the sum of index i and the length of the pattern is less than the length of the string, go to Step 4.
Step 14. Output of the message informing that the target pattern is not found.
Step 15. Stop. The combined algorithm allows finding matches between the pattern and a part of the string with a smaller number of shifts, due to the fact that in Step 11 a larger shift is chosen out of two shifts, which can be received with Knuth-Morris-Pratt and Boyer-Moore string searching algorithms. At the same time, these algorithms themselves guarantee that however large the shift is, no match of the pattern and the string will be overlooked.

Results and discussion
The combined string searching algorithm based upon Boyer-Moore and Knuth-Morris-Pratt string searching algorithms was implemented within computer programme InfoSearch. This programme allows carrying out the analysis of work of Boyer-Moore, Knuth-Morris-Pratt algorithms and the combined algorithm. The interface of the programme is shown in Figure  10. The analysis of the considered string searching algorithms work results can be shown in the example of The Tragedy of Hamlet, Prince of Denmark written by William Shakespeare. Table 1 shows the results of the algorithms when searching for the words written in the first column.