No new ideas

That algorithm I wrote about last week, for exact pattern matching using border arrays? Well, it turns out that Gusfield already described that in 1996 in his book Algorithms on Strings, Trees, and Sequences.

Both Storm and I have definitely read this at some point and must then have forgotten. It is hard to say if we then reinvented the idea or just suffer from cryptomnesia.

Linear time exact pattern matching

Yesterday, Storm and I had exams in String Algorithms. During the exams we came up with a simple linear time exact pattern matching algorithm that is very similar to the Knuth–Morris–Pratt algorithm but a little bit simpler.

The Knuth-Morris-Pratt algorithm is based on border arrays. A border of a string \(x\) is any substring that is both a prefix and a postfix of \(x\). The border array for \(x\) is an array that for each index \(i\) holds the length of the longest border for prefix \(x[1\ldots i]\). You can compute the border array for a string “key” using the algorithm below:

    unsigned long m = strlen(key);
    unsigned long ba[m];
    ba[0] = 0;
    for (unsigned long i = 1; i < m; ++i) {
        unsigned long b = ba[i-1];
        while (b > 0 && key[i] != key[b])
            b = ba[b-1];
        ba[i] = (key[i] == key[b]) ? b + 1 : 0;
    }

There is a double loop, but you can show that the total number of iterations in the inner loop never exceeds the total number of iterations of the other loop, so the algorithm computes the border array in time \(O(m)\) where \(m\) is the length of the string.

The Knuth-Morris-Pratt algorithm computes the border array for the key it searches for and then has a loop through the string it searches in where it uses the border array to step forward when it sees a mismatch. You can, however, also search in linear time just using border arrays. This is something Storm has used as an exercises in our class for years now. The idea is this, if you want to search for pattern \(p\) in string \(x\), then you can construct the string \(p\#x\) where # is a sentinel symbol that is not found in \(p\) and \(x\). If you build the border array for this string, you can find all occurrences of \(p\) in \(x\) by finding the indices \(i>m\) where the border array contains \(m\) and then report \(i-m+1\).

What we realised yesterday while asking questions at the exam was that you never actually use the border array beyond the first \(m\) entries in this algorithm. You just need to know how long a border would be at any point, but it can never grow beyond \(m\) by construction, so there is no need to store it. Instead, we can just build the border array for the key we search for and then pretend that we build the border array for the rest, but really just keep track of the length of it.

A complete search function based on this can be written in a few lines of C:

static void ba_search(char * key, char * buffer)
{
    unsigned long n = strlen(buffer);
    unsigned long m = strlen(key);
    unsigned long ba[m];
    ba[0] = 0;
    for (unsigned long i = 1; i < m; ++i) {
        unsigned long b = ba[i-1];
        while (b > 0 && key[i] != key[b])
            b = ba[b-1];
        ba[i] = (key[i] == key[b]) ? b + 1 : 0;
    }
    unsigned long b = 0;
    for (unsigned long i = 0; i < n; ++i) {
        while (b > 0 && buffer[i] != key[b])
            b = ba[b-1];
        b = (buffer[i] == key[b]) ? b + 1 : 0;
        if (b == m)
            printf("%lu\n", i - m + 1);
    }
}

The algorithm is essentially just Knuth-Morris-Pratt. It uses the border array for the key in exactly the same way. It is just slightly simpler because it doesn’t consider the search part of the algorithm as different from the preprocessing. Both the loops in this function are essentially doing the same thing—the only reason they are different is that one builds the border array and only looks at the key while the other only keeps track of border lengths and looks at the string we search in.

New book: The Beginner’s Guide to GitHub

I finished some lecture notes on how to use git and GitHub yesterday, and since I had them all written up in Markdown I could translate them into a booklet at the click of a button—which I did—and then put the booklet on Amazon.

I had planned to put it up as a free book, but apparently I am not allowed to do that. They have a minimum price of $0.99 for all books, and books that are larger than 10Mb have a minimum price of $2.99. Since my booklet is full of screenshots it makes it just above 10Mb—so no option of giving the book away. At least, not unless I make it available somewhere else, but I don’t really have the infrastructure for that, and Leanpub would cost me $124 if I wanted to give it away there.

I could, however, enrol it in Amazon’s Kindle Select. It gives Amazon exclusive rights to the book but also enables some promotion options—including making the book free for a limited time. I have chosen that, and from tomorrow and five days on you can get The Beginner’s Guide to GitHub for free. I will re-enrol it in the free promotion again afterward, but I don’t know if there is a grace period between promotions. If there is, you just have to be patient. It will be free again later. But why wait, you might as well get it tomorrow. After all, if it is free, you can just delete it if you don’t like it.

This is my first experience with Kindle Select. My other books have been available on Leanpub and iBooks as well as Amazon from day one. I’m curious how this system works. I might try it with future books as well—for now I just need it to make this booklet free.

Github in the classroom

There’s a piece on the GitHub blog about GitHub Classroom. I tried it last year, and I was not really impressed by it. I want to use GitHub in my teaching, but GitHub Classroom is useless, in my opinion, and here is why.

GitHub Classroom

With Classroom you set up a repository that the students can clone. You can set up a number of assignments and then you get URLs that you can give the students.

When they click the URL a new repository is created

But what they get is a clone that is completely decoupled from the repository you set up; worse, it lives in an Organisation space and not in the students’ own list of repositories—which makes the repositories harder for them to find.

Since the new repositories are completely decoupled from the originals, If you make changes to the original repository—for example, you can add some test data or test cases for the students to validate their code after they believe they have solved the problems you give them—these changes are not available for them to merge into their own repositories.

Now, I could see some benefits in being able to amend an assignment by updating a repository. Especially in the—admittedly unlikely—scenario where I have made a mistake in the assignment description or example code. It could also be very useful to be able to add data in a data science projects. Quite often, in real projects, the data you are working on gets updated, which is why it is so important to have automated analysis pipelines.

Just as a learning experience, I would like to be able to tell my students, “right, now you have analysed the data, but I have just changed it, do the analysis again!”. With GitHub Classrooms, I cannot do that.

Even worse, since there is no real connection between the original repository and the students’, you don’t get the review functionality you have for real repositories. You can see a snapshot of what the students have right now, and you can, of course, go back and look over their commits, but if you want to go and comment on what they are doing, you have to do it as individual line comments. You cannot combine a number of comments in a single review.

That is probably what annoyed me the most with using Classroom, and why I gave up on it half-way through the class where I used it. The functionality I’m used to on GitHub simply isn’t there with Classroom.

Using real repositories

I think there are two points to using GitHub in teaching: you teach the students skills they will most likely find use for in the future where they are likely to actually use GitHub, and it makes it a lot easier for me to correct assignments.

First point first. Using a crippled version of GitHub for assignments doesn’t show the students how to actually use GitHub. If you just want to show them how to use version control, it is fine, but for GitHub there is so much more you can do. Much of which you can’t do with Classroom.

You can, of course, still use Issues to have todo-lists of tasks and you have them as a discussion forum if they work in distributed groups—which they didn’t in the class where I used Classroom, but which they often do in other classes I teach. They can also use pull-requests within a group, if they want to, but if everyone in the group can push to the repository there is little need for it.

When you actually use GitHub, pull-requests and reviews are part of the usual workflow. It would only make sense to include it in the workflow used in class.

I realise and appreciate that there is a difference between a real project and a classroom setting, but it takes very little in addition to make the full experience part of class compared to just using GitHub as a version control system.

If, instead of using Classroom, you make a repository that the students fork, you have your repository and the students’ repositories connected. You can make changes to the repository and they can pull them. More importantly, if hand-ins are done as pull-requests, they get experience with that, and you can use the review features to correct the hand-ins.

You have the hand-ins as pull-requests; you can go in and compare changes from the forked version or since your last review; and you have a built-in discussion forum where you can discuss the code, either commenting directly on individual lines or give overall feedback.

This is an active document. If they make changes after a review, they become part of the pull-request, so you can keep track of their progress. And you can directly see what they changed since the last review, so you don’t have to read through the entire code base to see what the modified.

If they get stuck, you can even jump to their pull-request branch, fix an error for them, and push it back to them (for example as a pull-request to their own repository), after which they can continue.

Even better, you can set up automatic checks of their hand-ins using e.g. Travis. If you set up a suite of tests, the students get immediate feedback on how they are doing with solving the problem, and you can check at a glance—by the green tick-marks on the pull-requests—whether they have actually solved the problem they are supposed to.

You cannot set up a complete unit-test suite of their code, of course; you can only check that their code passes a specification you have made. You can, however, set up a code coverage check and keep track of how much they are testing their own code. Again, this is a way for yourself to see how they are doing at a glance and a way for them to get immediate feedback for themselves.

codecov.io coverage report
codecov.io coverage report

Setting up tests and coverage statistics is, of course, also something they could set up themselves for their own Classroom repositories, but that would require them to get familiar with those tools in addition to GitHub. At some point they probably should learn those tools, but if you set it up in the original repository they don’t have to before they earn the benefits of them—which might motivate them more to learn the tools later.

coveralls.io coverage report
coveralls.io coverage report

You can even set up automatic code reviews using codacy if the students are using a programming language supported there—unfortunately for my R classes, it doesn’t seem to support R.

By and large, having the students fork a repository seems, to me, a much better approach to using GitHub in the classroom that using GitHub Classroom is.

One remaining issue

The only thing that really bugs me with using GitHub and pull-requests in class is that I cannot easily include the pull-request merging in the workflow. At least, I haven’t figured out a good way of doing it yet.

There is nothing inherently wrong with just having an open pull-request per student group, but I would like to be able to merge these pull-requests as the way of accepting the hand-ins. It would be the right kind of closure to finishing an exercise.

It would also make it a lot easier for me to keep track of which groups have finished an assignment and which are still working on it.

The problem with just merging pull-requests is, of course, that if they all make pull-requests to the same branch, then you can’t merge them all.

One solution I’ve been playing with is to have separate branches per group. If they make pull-requests to separate branches, then these can easily be merged and everything is fine. This just requires that I set up branches for each group before they make pull-requests—and that they actually make pull-requests to the right branches—but it is a solution that would work. It just requires a little more work.

It’s not a problem for me to have to set up the branches for each group. What worries me is that there is just that little extra work involved for the students in making the pull-request, especially if they use a GUI like GitHub Desktop. There, you can make pull-requests very easily, but for the branch solution to work, you would have to remember to pick the right branch to make the pull request to.

Not impossible to remember, but a little extra to keep track of while they are learning to use a new tool.

I don’t know if there is any way to change the “to”-branch of a pull-request. I don’t think so. If there were, that would definitely be the way to go.

Testing an O(n log n) running time

This morning I noticed that this blog had been hit by this search term:

I’m not sure why it hit my blog since I have never written about that (as far as I recall), but I know that my current students have to worry about an $$O(n \log n)$$ running time algorithm for their current project.

They have to implement an algorithm for finding tandem repeats, and the essential part of the algorithm is locating “branching” tandem repeats in time $$O(n \log n)$$.  To validate that they get it right, I ask them to run some profiling and describe how they test this.  Perhaps it is my students googling to figure out how?

Well, depending on how you think about this, it is either very simple or very hard to validate an $$O(n \log n)$$ running time.

What does the big-O notation really mean?

When we have an algorithm that runs in time $$O(n \log n)$$, what we mean is that the worst case running time on input of size $$n$$ is (eventually) dominated by $$\alpha \cdot n \log n$$ for some constant $$\alpha$$. So if we consider the running time for all input of size $$n$$ and take the maximum of that, that is the worst case running time.  Let $$t(n)$$ denote this.  Saying $$t(n)\in O(n\log n)$$ simply means that there exists an $$\alpha$$ such that $$\exists N\,:\,\forall n > N\,:\, t(n) \leq \alpha \cdot n \log n$$.

Sometimes we don’t want to argue about worst case running time. For example sometimes we worry more about the expected running time, then the $$t(n)$$ function would take the expectation rather than the maximum over all input of size $$n$$.  If this is the case, we make it explicit, so whenever we talk about $$O(n \log n)$$ without any qualifier, we are talking worst case running time.

By this definition, $$t(n)\in O(n \log n)$$ doesn’t mean that $$t(n)$$ is some function $$t(n) = \beta n \log n$$ or anything like that.  The big-O notation only puts an upper bound on $$t(n)$$, it doesn’t say that this upper bound is tight or anything.  We have other notations for that, but you don’t see it so much out of algorithmics circles.

Anyway, to simplify matters, I am going to assume for the following that $$n \log n$$ is a tight bound, and by this I mean that $$t(n) = \beta \cdot n \log n + o(n \log n)$$ for some constant $$\beta$$.  Here the little-O notation means that there might be some extra time consumption in $$t(n)$$ but that $$\beta \cdot n \log n – t(n)$$ tends to zero as $$n$$ tends to infinity.

Can you prove a certain running time with experiments?

For an $$O(n \log n)$$ algorithm, we mathematically prove that the worst case running time is dominated by some $$\alpha n \log n$$.  We don’t experiment our way out of it.  But if we wanted to, could we?

Let’s assume for a second that we don’t have to worry about all input of size $$n$$ to compute $$t(n)$$ – and that’s a big assumption in many cases, since many algorithms have different running time depending on what the actual input is, so the running time isn’t really a function of just the size of the input.  If we really did have to worry about this we would need to run our program on all possible input of size $$n$$ for each $$n$$ we want to examine, and the combinatorial complexity of this would soon make it impossible to even attempt.

So let’s assume that the running time only depends on the size of the input. Could we then distinguish between, say $$t(n) = \beta \cdot n \log n$$ and $$t(n) = \beta \cdot n \log n + \delta \cdot n^2$$?

If $$\delta \ll \beta$$ it is pretty hard!

The two are pretty close!  And that is ignoring measurement errors that would muddy the picture further.

Of course, eventually, for some large $$n$$, the $$\delta\cdot n^2$$ term will dominate the $$\beta\cdot n \log n$$ term, but for the range you measure it might not.

With your experiments you are best just validating that the algorithm, mathematically proved to be $$O(n \log n)$$ was implemented correct and appears to be running in time $$\beta \cdot n \log n$$ (but again remember that this is assuming that $$t(n)$$ only depends on $$n$$ and not the actual input and that $$t(n) = \beta \cdot n \log n + o(n \log n)$$ i.e. that the bound is tight.

Validating an n log n running time

So how do you validate that $$t(n) = \beta \cdot n \log n$$?

Well, if $$t(n) = \beta \cdot n \log n$$ then $$t(n)/\log n = \beta \cdot n$$, so just plot $$t(n)/\log n$$ and check if it looks like a line!

Simple, eh?

What about the base of the logarithm?

In the algorithm my students are implementing, the $$\log n$$ in $$O(n \log n)$$ isn’t the natural logarithm.  It rarely is in computer science.  It isn’t even the base-2 logarithm as it often is.  The base actually depends on the kind of input (but let’s keep ignoring that).

What do you do when you don’t know the base of the logarithm and you want to validate an $$O(n \log n)$$ algorithm?  Do you need to divide $$t(n)$$ with the right base log to get a line?

Of course not.  Remember $$\log_b x = \frac{\log_k x}{\log_k b}$$, so if the running time is with a log base $$b$$: $$t(n) = \beta \cdot n \log_b n = \frac{\beta}{\log_k b} \cdot n \log_k n$$, and you plot it with a log base $$k$$, then $$t(n)/\log_k n = \frac{\beta}{\log_k b} \cdot n$$ is still a line.

That is why, when we write the complexity as $$O(n \log n)$$ we don’t bother with the base.  All logs are the same under a big-O.

In practice, of course, the base of the logarithm matters as it affects the constant in front of the $$n \log n$$, but for the big-O notation it doesn’t, and for validating that your program runs in time $$O(n\log n)$$ it doesn’t either.