Linear time exact pattern matching

Yesterday, Storm and I had exams in String Algorithms. During the exams we came up with a simple linear time exact pattern matching algorithm that is very similar to the Knuth–Morris–Pratt algorithm but a little bit simpler.

The Knuth-Morris-Pratt algorithm is based on border arrays. A border of a string \(x\) is any substring that is both a prefix and a postfix of \(x\). The border array for \(x\) is an array that for each index \(i\) holds the length of the longest border for prefix \(x[1\ldots i]\). You can compute the border array for a string “key” using the algorithm below:

    unsigned long m = strlen(key);
    unsigned long ba[m];
    ba[0] = 0;
    for (unsigned long i = 1; i < m; ++i) {
        unsigned long b = ba[i-1];
        while (b > 0 && key[i] != key[b])
            b = ba[b-1];
        ba[i] = (key[i] == key[b]) ? b + 1 : 0;
    }

There is a double loop, but you can show that the total number of iterations in the inner loop never exceeds the total number of iterations of the other loop, so the algorithm computes the border array in time \(O(m)\) where \(m\) is the length of the string.

The Knuth-Morris-Pratt algorithm computes the border array for the key it searches for and then has a loop through the string it searches in where it uses the border array to step forward when it sees a mismatch. You can, however, also search in linear time just using border arrays. This is something Storm has used as an exercises in our class for years now. The idea is this, if you want to search for pattern \(p\) in string \(x\), then you can construct the string \(p\#x\) where # is a sentinel symbol that is not found in \(p\) and \(x\). If you build the border array for this string, you can find all occurrences of \(p\) in \(x\) by finding the indices \(i>m\) where the border array contains \(m\) and then report \(i-m+1\).

What we realised yesterday while asking questions at the exam was that you never actually use the border array beyond the first \(m\) entries in this algorithm. You just need to know how long a border would be at any point, but it can never grow beyond \(m\) by construction, so there is no need to store it. Instead, we can just build the border array for the key we search for and then pretend that we build the border array for the rest, but really just keep track of the length of it.

A complete search function based on this can be written in a few lines of C:

static void ba_search(char * key, char * buffer)
{
    unsigned long n = strlen(buffer);
    unsigned long m = strlen(key);
    unsigned long ba[m];
    ba[0] = 0;
    for (unsigned long i = 1; i < m; ++i) {
        unsigned long b = ba[i-1];
        while (b > 0 && key[i] != key[b])
            b = ba[b-1];
        ba[i] = (key[i] == key[b]) ? b + 1 : 0;
    }
    unsigned long b = 0;
    for (unsigned long i = 0; i < n; ++i) {
        while (b > 0 && buffer[i] != key[b])
            b = ba[b-1];
        b = (buffer[i] == key[b]) ? b + 1 : 0;
        if (b == m)
            printf("%lu\n", i - m + 1);
    }
}

The algorithm is essentially just Knuth-Morris-Pratt. It uses the border array for the key in exactly the same way. It is just slightly simpler because it doesn’t consider the search part of the algorithm as different from the preprocessing. Both the loops in this function are essentially doing the same thing—the only reason they are different is that one builds the border array and only looks at the key while the other only keeps track of border lengths and looks at the string we search in.

Author: Thomas Mailund

My name is Thomas Mailund and I am a research associate professor at the Bioinformatics Research Center, Uni Aarhus. Before this I did a postdoc at the Dept of Statistics, Uni Oxford, and got my PhD from the Dept of Computer Science, Uni Aarhus.

Leave a Reply