Damn simple suffix array construction

As you may or may not know, suffix arrays are the key data structure for many string algorithms. Well, the key data structure is suffix trees but those take up way too much memory so we use suffix arrays instead because they can do the same things (with a little extra bookkeeping) while taking up a lot less memory.

We just finished our class on string algorithms this week and Storm (my co-teacher) and I were talking about simple suffix array constructions. There are many different algorithms for constructing suffix arrays. There are several algorithms for constructing suffix arrays in linear time — although in practise these are often slower than algorithms with worse big-O complexities — but we were just talking about very easy implementations of them.

In C, where strings are just pointers to to null-terminated byte buffers you can get a representation of all suffixes of a string — so all substrings that ends at the same point as the original string — by having a pointers into that string. So it is easy to represent the set of all suffixes as an array of pointers.

The suffix array is just an array that contains the rank (the order) of all suffixes, so if you can just soft such an array you have constructed the suffix array.

In C you can do that just by sorting the array of pointers.

The worst-case complexity of this is pretty bad — you have an O(n log n) complexity for sorting and each comparison could in principle take time n as well since you have to compare strings that on average n in length — but in practise you can compare strings a lot faster since they do not typically share long prefixes (so you can terminate the comparison quickly) and the n log n complexity doesn’t matter much when you are using an optimised search algorithm.

So you should be able to sort the suffixes in a few lines of code using qsort and strcmp.

You need a wrapper to make qsort work with strcmp, but other than that, it is really that simple.

Below is my implementation of that. It is a complete program that takes a string on the command line and outputs the suffix array for it.

I also made a version that reads a string from a file and on my laptop I can build a suffix array for a file that is around a hundred megabases in a minut or so (for a human chromosome) so it isn’t that bad in performance.

Here is my simple implementation:

#include <stdio .h>
#include <stdlib .h>
#include <string .h>

// Wrapper needed to make strcmp work with qsort
int strcmp_wrapper(char **x, char **y)
{
    return strcmp(*x, *y);
}

int main(int argc, const char * argv[])
{
    if (argc != 2) {
        printf("Usage: %s string\n", argv[0]);
        return 1;
    }

    const char *x = argv[1];
    size_t n = strlen(x);
    char *suffixes[n];

    // Collect all the suffixes in an array
    for (int i = 0; i < n; i++)
        suffixes[i] = (char*)x + i;
    printf("All suffixes:\n");
    for (int i = 0; i < n; i++)
        printf("Suffix %d: %s\n", i, suffixes[i]);
    printf("\n");

    // Sort the suffixes
    qsort(suffixes, n, sizeof(char<em>), (int(</em>)(const void*, const void*))&strcmp_wrapper);

    printf("Sorted suffixes:\n");
    for (int i = 0; i < n; i++)
        printf("Suffix %d: %s\n", i, suffixes[i]);
    printf("\n");

    // The actual suffix array should be indices and not the raw pointers so we need to do this
    int suffix_array[n];
    for (int i = 0; i < n; i++)
        suffix_array[i] = (int)((const char *)suffixes[i] - x);

    printf("Suffix array:\n");
    for (int i = 0; i < n; i++)
        printf("SA[%d] = %d\n", i, suffix_array[i]);
    printf("\n");
 
    return 0;
}