The Analysis of KMP Algorithm and its Optimization

Knuth-Morris-Pratt (KMP) algorithm is an essential manifestation of matching algorithms. This paper presents and discusses the KMP algorithm and some of its optimization. Calculating and searching in a letter numbered table and a new data structure called last-identical array propose a new string searching algorithm L-I-KMP based on the KMP algorithm. This new method is faster than KMP in practice of specific situations.


Original KMP algorithm
As Alzoabi et al. (2013) illustrated in their article, the KMP algorithm has been "considered as the first linear time string-matching algorithm with a serial cost of O(m+n)" [6]. The KMP algorithm traverses the given pattern string from head to tail, trying to find the longest common elements between the prefix and the suffix of each substring of the pattern, and take down the length of the common part in a "Failure Table", and the table should be of the same length to the pattern [7]. Each letter of the pattern in the Failure Table has a corresponding number to be calculated. Then comparison starts from the first letter of P and T, let N1 denotes the number of matched characters and updates with the comparison process, and represents "the corresponding number in the Failure Table of the last matched character" as N2. If the not-match situation starts from one letter of P, then the digits that P needs to move towards the right are: (N1-N2). The array "next" can be used to fulfil the identical functionality.

Existing optimization of KMP algorithm
The "Failure Table" mentioned above is also called PMT (Partial Match Table). However, for programming convenience, it is not used directly in general. Engineering usually shifts the PMT array one bit to the right and put the number "-1" to the first position. We call the new array "the next array". There are numbers of optimizations that start from the next array; most of them try to figure out the biggest jump that the pattern can do.
Professor Robert S. Boyer and J. Strother Moore presented BM Algorithm in the year 1977. It can be regarded as an improvement of the KMP algorithm on account of adopting part of the idea of the KMP algorithm and being more efficient than KMP algorithm in practice [8]. Some scholars consider it as the most efficient string matching algorithm in common applications [9]. Differentiating with KMP, BM algorithm starts comparing characters from the left of the pattern to the right, although the pattern and text are left alignment at first. When the comparison failed, there are two rules to determine the distance to the right that P needs to slide: "the bad character rule" and "the good suffix rule" [10]. The distance to slide is to be max {distance of "the bad character rule", distance of "the good suffix rule"}. Sunday algorithm is also a fast matching algorithm in practice [11]. In the matching process, it does not focus on the matching direction, but try to skip as many characters as it can when facing mismatch; moreover, it uses the next array as well, and quite like the BM algorithm. The worst computation complexity of Sunday algorithm is O(mn) [12].
The shift-add approach is based on finite automata theory and the limitation of the alphabet, combining ideas of KMP and BM algorithm [13]. This algorithm uses a vector of m in different states, and m is also the length of the pattern. S j i (1≤i≤m) means the set of states after comparing the j-th letter of the pattern, and the value of S j i is the number of characters that are mismatching in the corresponding positions, if S j m =0, then the pattern and text matches. The eKMP algorithm is proposed to enhance the efficiency of the KMP algorithm and its sophistication in detecting the disease DNA sequence, and this is done by continually updating the window size [14].
In 1987, Rabin and Karp presented the idea of hashing a pattern string and comparing its hash value with the text substring [15]. It can be used to detect plagiarism as it can handle multiple pattern matching. Even though this concept is inferior to the BF algorithm theoretically, its complexity can be comparatively low in practice, especially when using an efficient hash function. As KMP algorithm is only suitable for one-dimensional string matching when the Rabin-Karp algorithm is excellent at solving more dimensional matching problem. KMP+Rabin-Karp algorithm was proposed to achieve better efficiency [2]. KMP+Rabin-Karp algorithm combines these two algorithms to solve the twodimensioned strings matching problem.

Motivation of L-I-KMP algorithm
After knowing about several categories of optimal KMP algorithm, the author proposes another possible string matching algorithm by using the idea of the KMP algorithm to achieve a big move at one time. This new method refers to Sunday algorithm and makes use of a letter numbered table, and it uses a new data structure called last-identical array to decide the length of the move in some matching occasions. Similar to the KMP algorithm, the process is continued by referring to the data in the table when facing mismatches.

Preprocessing
Preprocessing is needed for this method. Before the match starts, it is necessary to analyze the data of the pattern, a letter table is being used to take down the letters and their position in the pattern, and the next same letter from the data structure can be directly queried according to the appearing number. The order of the position starts from the tail of the pattern. This idea is being illustrated in more detail in Figure 1. In Figure 1.a), the given pattern is "acabb", and the target text has been listed in the picture. It enables to browse from right to left and record each new character and its position during the process of preprocessing; this process is displayed in Figure 1 a) a pair of strings given b) make a letter table when preprocessing Figure 1. Example of L-I-KMP's preprocessing

The algorithm flow
This method is optimized with the Sunday algorithm after the first matching failure but is not precisely the same. In Sunday algorithm, the next step after mismatching is to see whether the next bit of the mismatched character is in the pattern, if not, the whole pattern can skip the character, moving a distance by "the length of matched letters" plus one step. While in L-I-KMP, it is necessary to focus on the next digit of the last bit aligned with the entire pattern string, which is A1 in this context. If A1 is not in the letter table, the amount of movement is the length of itself plus one. Otherwise, looking up the letter table to memorize the first number after A1, and this number, the author name it k, corresponds to the position of one same character in the pattern, call it A2. Align the two characters, and start comparison from the left of the pattern. When mismatches occur, refer to the table again to locate the next number after k, and align the new location of the pattern with A1. The same logic continues until there is no corresponding number for A1. The next bit of the last aligned bit can be found at this point, and thereby to name a new A1. Figure 2 interprets the algorithm flow by using the same example as Figure 1. In Figure 2.a), after preprocessing, the first step matches. However, b ≠ c, so pattern[1] ≠ text [1], the next digit we need to consider is the sixth letter text [5] = f, while f does not exist in the pattern, the pattern can jump over "f", the result is in Figure 2.b). Then it is obvious that, g ≠ a, text [6] ≠ pattern [0], and text[6 + length of pattern] = text [11] = m is checked, and m does not appear in the letter table either. The whole pattern skips m.
In Figure 2.c), the pattern and the text does not match at pattern [1], and the last digit, text[12 + length of pattern] = text[17] = b, of the text is one component of the letter table. The numbers that follow it are 4 and 3 in order. The initiate activity is to drag the pattern and make pattern [4] align with the text [17], and then begin comparing from the left of the pattern. Fortunately, the whole pattern matched. Thus, there is no need to use the number "3" of the letter table any longer.

Optimization for coding convenience
For coding convenience while matching, refer to the format of the next array in the KMP algorithm and make use of the data from the letter numbered table, a last-identical array is introduced in Figure 3. In the last-identical array, the character in the pattern not only corresponds to its position number but also points to the previous position of the same letter, if this letter appears the first time from left to right, then it points NULL. In Figure 3, given the pattern "acabb" and its position array in order is "01234", its last-identical array is to be "NULL NULL 0 NULL 3".

Performance comparison of algorithms
The times of attempt and the amount of letter's match during the whole matching procedure are two essential elements to analyze the capability of the matching algorithm [16]. The pseudo-code of the KMP algorithm is expressed in Figure 4.  When using the KMP algorithm, the first step is to preprocess of the Prefix-Table of P, that is, to calculate the "Failure table" or the "next array". Then step (2) initializes q to be the number of characters that matched, at the beginning of q=0. In step (3), it is time to start scanning the text from left to right. Then in step (4) (5) (6), the algorithm starts matching, if the two characters qualify the equation, the number of matched letters-q, is to plus one. Or if the two does not match, assign p a new value: q=P [q]. The pseudo-code of the new algorithm is displayed in Figure 5.   The matching information displayed in Table 1 obtained from repeated running with the same data generated randomly. According to the figures in the table, it is noticeable that L-IKMP algorithm performs well in practice when the types of letters are only a few in the pattern, and the same letters