Posts Tagged ‘HMM’

Detecting Selective Sweeps: A New Approach Based on Hidden Markov Models

Wednesday, September 30th, 2009

Two of my main interests are hidden Markov models and selection.  A paper from this spring, in Genetics, combines the two:

Detecting Selective Sweeps: A New Approach Based on Hidden Markov Models

Boitard, Schlötterer and Futschik

Detecting and localizing selective sweeps on the basis of SNP data has recently received considerable attention. Here we introduce the use of hidden Markov models (HMMs) for the detection of selective sweeps in DNA sequences. Like previously published methods, our HMMs use the site frequency spectrum, and the spatial pattern of diversity along the sequence, to identify selection. In contrast to earlier approaches, our HMMs explicitly model the correlation structure between linked sites. The detection power of our methods, and their accuracy for estimating the selected site location, is similar to that of competing methods for constant size populations. In the case of population bottlenecks, however, our methods frequently showed fewer false positives.

Selective sweeps

Under a simple Wright-Fisher model, a neutral mutation that is just introduced into a population  can slowly increase and decrease in frequency until it is eventually either fixed in the population, which happens with probability \frac{1}{2N_e}, or until it is lost from the population againg, which happens with probability 1-\frac{1}{2N_2} of course.

The expected time from such a mutation is introduced into the population and until it is fixed, if it is lucky to be fixed, is 2N_2 generations.  During this time, the descendant chromosomes of the original mutant chromosome will be subjected to new mutations and to recombinations.

Once this mutation is fixed, everyone in the population will of course share that particular mutation (ignoring back-mutations and such here), but because of recombination nearby sites will not necessarily all be derived from the original mutation chromosome.  Close to the mutation site — where few recombinations will have broken up the sequence — most chromosomes will be derived from the mutation chromosome and as we move away from the mutation site fewer chromosomes will be derived from that original chromosome.

Now, if the mutation introduced has a selective advantage, essentially the same process will play out.  In each generation there is a slightly higher chance that this mutation will have off-springs, but that is essentially the only difference.

What this means is that initially there is still a very good chance that the mutation will be lost — even with slightly better odds accidents do happen — but once the mutation has reached a reasonable frequency it is almost guaranteed to reach fixation — unless a lot of accidents happen.

Once the frequency of the site under selection is high enough it will very quickly reach fixation.  The expected time it takes depends on the selection strength but unless the selective advantage is very small it will reach fixation a lot faster than if it was neutral.  Think logarithmic time in the size of the population compared to linear time.

Since it reaches fixation much faster than a neutral mutation, fewer mutations and fewer recombinations will have time to occur, so a much wider region around the mutation site will be shared by all descendant chromosomes.  Combined, this means that for a selected site you expect a wide region with a more recent shared ancestor than you would expect at a neutral site, a phenomena called a selective sweep.

Site frequency spectra

Now, from the population genetics model you can work out — putting your thinking hat on or just simulate — the expected distribution of derived and ancestral alleles: the site frequency spectrum.  This will be different from neutral alleles and selected alleles because of the shorter time back to the common ancestor for the selected sites.  The shorter site means that there is a general reduction in polymorphism near a selected site, and derived alleles that appeared on chromosomes with the beneficial mutation will be at a higher frequency than they would be if they weren’t “hitchhiking” on the selection of the beneficial mutation.

The pattern is a bit complicated by recombination, since you need to take into account that the further away from the selected site you look, the weaker the hitchhiking effect will be; a new mutation can only hitchhike as long as it is linked to the selected site, and recombinations break that link.

Anyway, the different spectra of derived and ancestral alleles can be used to detect selective sweeps.  Two methods that exploit this, that is relevant for this post, are Kim and Stephan (2002) and Nielsen et al. (2005).

Of course, selection is not the only thing that can mess up the site frequency spectrum and make it different from the expected neutral distribution.  Demographic effects like expending populations and bottlenecks can look very similar to selection effects, so we cannot absolutely rule out neutrality if we see a deviation from the expected spectrum.  Still, the site frequency spectra of neutrality versus selection can be used for scanning for selection.

Detecting sweeps in a hidden Markov model

The new result in the Genetics paper is a hidden Markov model that uses site frequency spectra to scan for selective sweeps.

Using an HMM means that the model can capture spatial patterns along a genome and capture transitions from “neutral” regions — where no sweep has occurred or is occurring — from “selected” regions — where a sweep occurred or is occurring.  So you don’t have to assume that a locus you are looking at is either a neutral region or a selected region and you don’t have to fiddle around with sliding windows to scan a genome, you explicitly capture the changing patters.

One of the nice properties of HMMs for genomic scans and the reason I love them so much.

The model Boitard et al. develop is quite simple.  They have three states: a neutral state, a selected state, and an intermediate used to capture sites that are slightly caught up in the hitchhiking but not close enough to a selected site to get the full effect.

The transition matrix has a single parameter, p, that is the probability that a neutral or selected site switches to the intermediate state (and the intermediate state switches to those two with equal probability set to p/2).

T=\begin{pmatrix}1-p&p&0\\ p/2&1-p&p/2\\ 0&p&1-p\end{pmatrix}

This of course has the unfortunate effect that the prior distribution (stationary distribution) of the chain will give you 25% chance of a site being neutral, 25% chance of it being selected and 50% chance of being intermediate, which doesn’t really match my expectation of the amount of selection in, say, a human genome. Also, the (prior) expected length of a sweeped region is the same as a neutral region which also does not match my intuition.  With enough data, though, the likelihood should overrule the prior so perhaps it is not too much of a worry…

The emissions of the model are frequencies of derived alleles, so for each site it will emit a frequency that depends on the state.  This is where they capture the different expected frequencies depending on whether a site is neutral or selected.

They use the Kim and Stephan’s and Nielsen et al. methods for this, to develop three variations of HMMs: HMMA, using Kim and Stephan, HMMB using Nielsen et al. and HMMB-SEQ, that also uses Nielsen et al. but only considers segregating sites.  The latter is only for comparison purposes and of course ignores a lot of the information in the data, since the amount of non-segregating sites reflects the general level of polymorphism in a region which again is dependent on the depth of the local genealogy and will be affected by selection.

They use simulations under neutrality to fix the parameter p so they get a 5% false positive rate, and then use the models to scan for sweeps.

They get an okay power for detecting sweeps, but compared to the previous methods they don’t get that much since they did pretty good as well:

Table 1Where they refer to this table in the paper they say they have a higher power, but compared to the CLsw column, the Kim and Stephan’s method, they do not.  After all, it is difficult to beat a power of 1.

They do, however, appear to be more robust to bottlenecks where the two other methods have very high false positive rates:

Table 5


Boitard, S., Schlotterer, C., & Futschik, A. (2009). Detecting Selective Sweeps: A New Approach Based on Hidden Markov Models Genetics, 181 (4), 1567-1578 DOI: 10.1534/genetics.108.100032
273-307=-34

HMMoC and HMMConverter

Friday, September 18th, 2009

I just want to say a few words about a short paper I read last week, and a paper that is a few years old now but related to it.

The first is out in advanced access in Nucleic Acids Research:

HMMConverter 1.0: a toolbox for hidden Markov models

Lam and Meyer

Hidden Markov models (HMMs) and their variants are widely used in Bioinformatics applications that analyze and compare biological sequences. Designing a novel application requires the insight of a human expert to define the model’s architecture. The implementation of prediction algorithms and algorithms to train the model’s parameters, however, can be a time-consuming and error-prone task. We here present HMMCONVERTER, a software package for setting up probabilistic HMMs, pair-HMMs as well as generalized HMMsand pair-HMMs. The user defines the model itself and the algorithms to be used via an XML file which is then directly translated into efficient C++ code. The software package provides linear-memory prediction algorithms, such as the Hirschberg algorithm, banding and the integration of prior probabilities and is the first to present computationally efficient linear-memory algorithms for automatic parameter training. Users of HMMCONVERTER canthus set up complex applications with a minimum of effort and also perform parameter training and data analyses for large data sets.

the other was published in Bioinformatics in 2007:

HMMoC – a compiler for hidden Markov models

Lunter

Hidden Markov models are widely applied within computational biology. The large data sets and complex models involved demand optimized implementations, while efficient exploration of model space requires rapid prototyping. These requirements are not met by existing solutions, and hand-coding is time-consuming and error-prone. Here, I present a compiler that takes over the mechanical process of implementing HMM algorithms, by translating high-level XML descriptions into efficient C++ implementations. The compiler is highly customizable, produces efficient and bug-free code, and includes several optimizations.

Both papers describe compilers that generate C++ implementations of hidden Markov model algorithms from XML specifications, and really they are very similar.

The basic HMM algorithms are quite straightforward to implement, but if you want more complex models such as pair-HMMs or generalized HMMs there is a tad more complications to deal with, and if you need to optimize the algorithms in either runtime or memory usage there are some more complex algorithms you can use such as “banding” – implemented in both HMMoC and HMMConverter – that risk giving sub-optimal results but at a much reduced running time and memory consumption, or the Hirschberg algorithm – only implemented in HMMConverter as far as I can see – that exchanges a doubling in running time for a much reduced memory consumption.

Implementing such extra algorithms is not conceptually hard, but can be quite tedious and error prone, so it makes good sense to have code generators building the algorithms for you.  That is exactly what these tools do.

At a bird’s eye view, the tools are very similar.  You specify the HMM in an XML file (a specification language that I personally don’t like that much, but that is of course very subjective) and the tools then generate the algorithms you ask them to, output as C++ code.

HMMoC provides a number of handles for you to add your own C++ code to the generated code; I am not sure if HMMConverter does the same, but on the other hand HMMConverter provides handles for various constraints on the parameters so it might be easier to re-parameterize models made with that.

Another cool feature unique to HMMConverter is priors on sequence annotation.  You can provide an annotation to the input sequence(s) that is then incorporated in the emission probabilities.  The prior is really on hidden states, but incorporating them into the emission probabilities has exactly the effect you want from them: they weight the posterior probabilities of the hidden states along the input.

To deal with numerical issues, HMMConverter works in log-space while HMMoC uses something called “extended-exponent real numbers”.  Working in log-space can be really slow for the Forward and Backward algorithms, since you have to switch in and out of log-space to deal with sums of probabilities (the Viterbi algorithm doesn’t have this problem, so there the log-space solution is pretty fast).

Unfortunately, there isn’t any comparison between the execution times of algorithms generated with the two tools in the new paper, so I don’t know how much this matters.  In the HMM library I am developing with Andreas we found that the log-solution was very slow, though, and therefore we use a re-scaling approach instead.

I would love to see a comparison of the runtime efficiency between the approaches, but just not quite enough to go and do it myself right now…

  • Lam, T., & Meyer, I. (2009). HMMCONVERTER 1.0: a toolbox for hidden Markov models Nucleic Acids Research DOI: 10.1093/nar/gkp662
  • Lunter, G. (2007). HMMoC a compiler for hidden Markov models Bioinformatics, 23 (18), 2485-2487 DOI: 10.1093/bioinformatics/btm350

261-289=-28

Multi-core HMMs

Thursday, June 11th, 2009

I’m working on a project where it looks like I could end up with hidden Markov models with hundreds if not thousands of states, and with transition matrices where all entries are non-zero.  Since, for such models, the running time for just calculating the likelihood (not even optimising it) is O(m\cdot n^2) where m is the length of the input and n is the number of states, this could be a problem.  Especially since I’m working on whole genome data, so m can get pretty large as well.

It is part of our CoalHMM work, but I’ll tell you more about that later.  For this post, the actual application isn’t all that important, only the size of the model.

So, anyway, needless to say I need a fast implementation if this is ever going to work.

HMM algorithms

The basic hidden Markov model algorithms are essentially just filling in dynamic programming tables.

The “Forward” algorithm, for example, looks something like

for (int i = 0; i < m; ++i) {
    for (int j = 0; j < n; ++j) {
        double sum = 0.0;
        for (int k = 0; k < n; ++k) {
            sum += T[k,j] * E[j,x[i]];
        }
        F[i,j] = sum;
    }
}

There’s a few more details in a real implementation, ’cause this version will have underflows in a few steps, but this is the essential idea.

SIMD speedups

We have developed a small library with these algorithms here at BiRC (mainly implemented by one of my student programmers, Andreas) that speeds up the algorithms using single instruction, multiple data (SIMD) instructions.

The idea here is that we can calculate the sums as above by adding several entries in parallel.  For doubles we can handle two numbers in parallel, for floats we can handle four, so ideally we can speed up the algorithms by a factor of two or four.

We don’t exactly achieve that – there is some overhead involved – but it is not too bad.  The figure below compares the running time with and without SIMD for varying number of states, n, with a fixed input length m=10,000.

The number type – float or double – doesn’t matter for the plain algorithm since the same number of operations is used, so I’ve only plotted the time for doubles.

This is all good and well, but with really large HMMs I think we will need even more of a boost.

Multi-core speedups

It is not hard to get your hands on a multi-core machine these days – if anything it is hard to get a single core machine.  Not massively multi-core machines, although I expect to see those in a few years, but at least dual or quad core machines. So I was hoping to get another factor of two or four from running two or four threads.

In general, parallelising the instructions in the program is harder than parallelising the data operations as we do with the SIMD.  For one thing, there is a lot of synchronisation issues when running multiple threads, and this leads to a lot of overhead.

When filling in the dynamic programming table as above, each column in the table depends on the previous column, but each cell in the column is independent from the other cells in the column.  Thus, we should be able to process each cell in a column independently, but synchronise the threads after each column so we ensure that the previous column is always fully processed.

Since the synchronisation isn’t free, we expect that we need a large number of rows in the table to make it worthwhile, but in my project I expect hundreds to thousands of states, so perhaps there is some gain to it.

I decided to try it out, anyway.

Setting up threads and synchronising them and all that used to require some programming effort, but now that we have OpenMP it is very easy to add parallelism to a program.  To parallelise the algorithm above, you simply add a pragma:

for (int i = 0; i < m; ++i) {
#pragma omp parallel for
    for (int j = 0; j < n; ++j) {
        double sum = 0.0;
        for (int k = 0; k < n; ++k) {
            sum += T[k,j] * E[j,x[i]];
        }
        F[i,j] = sum;
    }
}

and by magic the middle loop is parallelised so the body of it is run in different threads.  Setting up the threads, assigning ranges of the sum to the different threads, synchronising them after the loop, etc. is all taken care of by the compiler.

Neat.

The results are plotted below (on my dual core iMac):

As is evident, there is a large overhead in the threaded version, so for anything below a few hundred states it is slowing the algorithm down instead of speeding it up, but eventually the parallelism does kick in.

Well, for the version without SIMD at least, which is the one I’ve plotted above.

For double SIMD, it is just worth it for the large number of states

while for floats it isn’t really worthwhile for this range of number of states

Of course, this is just the very first attempt at parallelising the algorithms, so there might still be some tricks to try.  For one thing, I have a feeling I can get rid of some of the thread overhead by adding a pragma outside the outer loop. I recall seeing something like that before but I don’t remember quite how it worked.

Anyway, I’ll play with it some more later.

162-165=-3

Widespread genomic signatures of natural selection in hominid evolution

Tuesday, May 12th, 2009

Friday last week, PLoS Genetics published a paper I’ve been waiting to read for a few weeks, since I saw a reference to it in a draft of a review paper I got by email (that paper I’ll tell you all about when it comes out).

The PLoS Genetics paper is this:

Widespread Genomic Signatures of Natural Selection in Hominid Evolution

Graham McVicker, David Gordon, Colleen Davis, and Phil Green

Selection acting on genomic functional elements can be detected by its indirect effects on population diversity at linked neutral sites. To illuminate the selective forces that shaped hominid evolution, we analyzed the genomic distributions of human polymorphisms and sequence differences among five primate species relative to the locations of conserved sequence features. Neutral sequence diversity in human and ancestral hominid populations is substantially reduced near such features, resulting in a surprisingly large genome average diversity reduction due to selection of 19–26% on the autosomes and 12–40% on the X chromosome. The overall trends are broadly consistent with “background selection” or hitchhiking in ancestral populations acting to remove deleterious variants. Average selection is much stronger on exonic (both protein-coding and untranslated) conserved features than non-exonic features. Long term selection, rather than complex speciation scenarios, explains the large intragenomic variation in human/chimpanzee divergence. Our analyses reveal a dominant role for selection in shaping genomic diversity and divergence patterns, clarify hominid evolution, and provide a baseline for investigating specific selective events.

The reason I’ve been waiting for the paper is that it concerns something I am very interested in myself, and something we are working on in our CoalHMM group here at BiRC: detecting selection by detecting variation in effective population size along the genome.

Effective population size

Okay, the concept “effective population size” is a strange beast.  It doesn’t really have anything to do with population size, except in an idealised mathematical model, but is a single parameter that incorporates various different measures such as demographics and selection.

There’s a nice introduction to it in this John Hawks post: Did humans face extinction 70,000 years ago?

As described there, one way of looking at the effective population size is to define it from the average coalescence time of two random individuals in a population.  If we look at it that way, it is clear that selection will affect the effective population size.

A site under selection, if it gets fixed, will do so much faster than a site that is neutral.  A neutral site that gets fixed does so (on average) in time linear in the effective population size, while a site under selection does so in logarithmic time (regardless of whether it is positive or negative selection, surprisingly, but of course if it is negative selection the probability of it getting fixed is smaller).

If we consider a site where mutations occur that are selected against, but these are not fixed, we still see a reduction in the time between two random individuals but for a different reason: those ancestors that were selected against do not have descendants in the present population, so the number of possible ancestors of two random individuals is smaller and when we trace their ancestry back in time, they will find a common ancestor faster.

So in any case, if a site is under selection, we expect the mean time back to a common ancestor — the effective population size — to be reduced.

To muddy the waters a little bit: effective population size also affects selection since selection is stronger if the population size is large but that is a complication best left for another day…

Recombination

Recombination has an effect on this as well.

A site under selection will have a smaller effective population size, but so will nearby sites.  The reason for this is that neighbour nucleotides are likely to have the same most recent common ancestor — and thus the same divergence — with this probability depending on the recombination distance between them.

Consequently, we expect the effective population size to decrease as we move towards a site under selection, and increase again as we move away from it.

It is this kind of patter that McVicker et al. analyses in this paper.

Results

First they identify conserved genomic regions.  These are the regions that are probably under selection, since selection is one of the forces that will conserve sequences.

They do this by running a phyoHMM on an alignment of mammals (excluding those they will analyse later on to avoid biasing the results).

They then split the genome into two classes: those nucleotides within the 10% of the genome closest to a conserved region, and the 50% furthest away.  In these two classes they look at the level of polymorphism in humans, the divergence between human and chimp, and the number of informative sites supporting a grouping of human with gorilla — with chimp as an outgroup — and those grouping chimp with gorilla — with human as an outgroup.  The latter are signs of deep coalescence resulting in incomplete lineage sorting, and signs of a large effective population size in the human/chimp ancestor.

For all measures, they find that the effective population size seems to be reduced for the 10% closer to conserved regions compared to those 50% farthest away.

Since the measures are essentially all just measures of conservation, really, that isn’t in itself much of an argument.  All it says is that there is a correlation of conservation-ness along the genome.  To compensate for this, they then normalise with the divergence to macaque and to dog.  If it is just a reduction in substitution rate that is correlated, then normalising this way — assuming that the substitution rate doesn’t change dramatically along the genome and along the phylogeny — will alleviate the effect from just the substitution rate.

After normalising, the signal is still there: the polymorphism and divergence is still reduced close to conserved regions.

Again, this doesn’t prove that selection is the cause of this pattern, but the pattern certainly matches what we would expect to see if it was selection that caused it.  The normalisation should eliminate, or at least reduce, effects that are just caused by the substitution rate, so unless we invoke some more exotic explanation for conservation and the patterns along the genome, selection is a valid conclusion.

(A) Ratios calculated using the 10% of neutral sites which are nearest to and the 50% of neutral sites farthest away from conserved segments or exons. (B) The same ratios as (A) but normalized by human/macaque (H/M) divergence to account for mutation rate variation or undetected sites under purifying selection. The distance to the nearest conserved segment or exon was determined using four different measures: physical distance, pedigree-based recombination distance [26], polymorphism-based finescale recombination distance [25] and the background selection parameter, B. B (described in the main text) is not technically a distance measure but incorporates information about the recombination rate and local density of conserved segments. Autosomal human nucleotide diversity was calculated from gene-centric SeattleSNPs PGA/EGP [20], whole-genome Perlegen [19] data, and HapMap phase II data [67]. Divergence was estimated using autosomal human/chimp (H/C), human/macaque (H/M), or human/dog (H/D) genome sequence data. HG and CG sites (where human and gorilla or chimp and gorilla share a nucleotide that differs from the other three species) were calculated using a smaller set of 5-species autosomal data. Repetitive regions were omitted from the Perlegen and HapMap analyses; additional filtering steps are described in the methods. Whiskers are 95% confidence intervals.

Now that selection is concluded to be a plausible explanation for the pattern, they fit the data to a model that explains the variation by background selection. This model shows that selection is stronger near conserved regions than farther away, consistent with the assumption that the pattern is caused by selection.

Consequences

So what does all this tell us?

For one thing, it tells us that selection is a force we really should keep in mind when analysing genomes.  Yes, yes, we probably already knew that, but the neutrality assumption is so strong in genome analysis that we rarely consider non-neutrality except for the obligatory dN/dS tests on genes.  For anything that is not a gene, we usually analyse the sequences assuming neutrality.  It is a good null model, but completely ignoring selection when analysing genomic sequences should be reconsidered.

I know, I am putting it a bit on an edge here, ’cause people are not just blindly assuming neutrality, but it is a strong null assumption and we really do not like to invoke selection unless there is strong evidence against neutrality.

Another consequence is for sequence divergence.

We estimate species divergence (time of speciation events) from sequence divergence.  More often than not we equate sequence divergence with specises diverergence, but really we shouldn’t.  Even under neutrality this isn’t true, since the coalescence process of sequences is such that the sequences are further apart than the species, but for neutrality at least this patter is random along the genome.

There is still some correlation along the sequence of divergence time, under a neutral coalescence model, but at least this correlation drops off rapidly with (recombination) distance and it is not correlated with other genomic features (except in the sense that the substitution rate depends on these features).

With selection working its magic on a genome scale, the patterns of sequence divergence gets a lot more interesting.

All of this is not really a new insight.  People working with e.g. Drosophila have known this for decades, but it has been ignored in more papers than I care to mention, and perhaps it is time we stop doing this.


McVicker, G., Gordon, D., Davis, C., & Green, P. (2009). Widespread Genomic Signatures of Natural Selection in Hominid Evolution PLoS Genetics, 5 (5) DOI: 10.1371/journal.pgen.1000471

132-145=-13

Numerical problems with hidden Markov models

Sunday, April 12th, 2009

Ok, so the problem I was working on a few days ago, where I needed interval arithmetic, concerns hidden Markov models.  We use these for genome analysis because they are pretty fast and thus scale to analysing whole genomes.  We still need to split the genomes into segments to parallelise our analysis, but still, it is pretty fast; at least a lot faster than MCMC methods or such.

Anyway, when we split the genomes into segments there is a tradeoff.  The larger segments, the more data we have for each analysis job, but of course each job also takes longer to run.

It is not the only trade off, though.  There is also a numerical issue: yes, the longer the segments the more observations we have, but the HMM algorithms have to do more arithmetic operations on the data, and numerical errors tend to accumulate with those.

I have no idea about how fast this happens, but I set out to find out, so I implemented the Forward algorithm with a parameterised type, so I could chuck in Boost’s interval type (with either floats, doubles or long doubles as the underlying type).

I use the “scaling” method for dealing with underflows (if you implement the Forward algorithm directly “from the book” you get an underflow from multiplying numbers between 0 and 1 pretty damn fast).  That means that in each column in the forward table, I calculate the column sum, call that sum the “scale” for that column, and then normalise all the values in the column so the column sums to 1.

As far as I understand, that approach is more numerical stable than working in log space, where you have to use various tricks to handle sums.  It is certainly a lot faster than translating numbers to and from log space all the time.

Anyway, I ran the forward algorithm on a 5 state HMM with an alphabet size of 4, with a random transition and emission matrix.  For each column in the table I can then look at the width of the intervals in the interval representation of my numbers.  The intervals are the intervals where the true value, had we infinite size floats to work with, would be.  The wider the intervals, the less certain we are about the true type if we just used plain floating point arithmetic.

First, I looked at the “scale” values.  The results are show below (notice that the y-axis is in log scale):

Initially the intervals are small, but then they just explode.  Not good for trusting the results; if the interval of possible values ranges over orders of magnitude, the true number can be just about anything! Depending on the underlying floating point type, the intervals explode early or late, but even for long doubles the explosion happens before 100 columns.  So the numerical accuracy can only be trusted for sequences shorter than 100!

Damn!

I also looked at the smallest and largest interval in each column.  The plot is below (shortest intervals in green, largest in red):

We see the explosion again, so it is pretty hard to distinguish between the smallest and largest intervals, so I have also plotted the interval widths for the prefixes before the explosion:

Before the explosion, it seems the nummerical error accumulates linearly (as we probably would expect).  There doesn’t seem to be much difference between the widest and the smallest intervals, though.

I expected something like the last plot for the entire range, I must admit.  Of course, I know that the accuracy is only good up to a point, because at some point you start adding very large numbers to very small (between 0 and 1) numbers, so after that point you just stick to the large number, but still…

I didn’t expect the explosion in inaccuracy that early.  Dealing with sequences of length less than 100 is just not feasible, and certainly not what anyone is doing.

Of course, the intervals here are worst case behaviour.  They are inclusive, so they capture all possible values that the true value could be in.  They say nothing about how close the values we get from working with finite size floating point values are to the true real value numbers.  That is probably what saves us in practise.

Still, I think I need to look into this in more details.  Who knows, there might even be a paper in this study somewhere…

102-127=-25