Posts Tagged ‘data analysis’

Simple data analysis in R

Friday, April 10th, 2009

A few weeks ago Amir sent me a mail with some blog topic suggestions.  These included "simple data analysis in R" and "computer science statistics".  I'm not sure what the latter means. Maybe some computational approaches to statistics, like sampling techniques or such, but it could also just mean using statistics to figure out some computer science problems.

Anyway, this is probably a bit simpler than what he has in mind, but since it is what I am working on right this moment, I can use this opportunity to blog about it.

I'm still working on getting a feeling for numerical issues in my analysis, and since interval arithmetic failed me (for now), I'm just comparing results when adding numbers in floats, doubles or long doubles.

Playing well with R

I have a C++ program that adds the numbers and outputs the differences with the different precisions.

To make the analysis easy on myself, I always make sure that my program outputs data in a format that is easy to read into R.  Sure, R has powerful IO functions, but I don't want to have to mess around with those when I can avoid it, so I always generate the data in the following format:

header1 header2 header3 ... headern
val11   val12   val13   ... val1n
val21   val22   val23   ... val2n
...     ...     ...     ... ...
valm1   valm2   valm3   ... valmn

where "headeri" is the name of a variable of interest and the following rows contain the various values of the variables (so they correspond to data points row wise).

In my C++ program, it looks like this:

	std::cout << "N d.f l.d" << std::endl;
	for (int N = start; N < stop; N += step) {
		float float_x = add_random<float>(N, rng_seed);
		double double_x = add_random<double>(N, rng_seed);
		long double long_double_x = add_random<long double>(N, rng_seed);

		std::cout << N << ' '
		          << (double_x - float_x) << ' '
		          << (long_double_x - double_x) << std::endl;
	}

where I output the number of additions, the difference between double and float (the variable I call d.f) and the difference between long double and float (the variable I call l.d).

Just for completeness, so you can see what I am actually doing, the add_random function looks like this:

template <typename float_type>
float_type add_random(unsigned int N, unsigned int rng_seed)
{
    boost::mt19937 rng(rng_seed);
    boost::uniform_real<> uni_dist(0,1);
    boost::variate_generator<boost::mt19937&, boost::uniform_real<> > random_01(rng, uni_dist);

    float_type x = 0.0;
    for (int i = 0; i < N; ++i) {
        x += random_01();
    }
    return x;
}

Now, to read in data like this, I can use the function read.table() like this:

experiments <- read.table('experiments.txt', header=TRUE)

where the header=TRUE tells the function that the first line contains headers.

The data in experiments is now a "data frame" containing columns for each variable.

I can get the names of the variables as:

> names(experiments)
[1] "N"   "d.f" "l.d"
>

or the first few rows as

> experiments[1:10,]
          N        d.f         l.d
1   1000000  -4.642890 0.00000e+00
2   2000000   2.903480 0.00000e+00
3   3000000  19.452800 0.00000e+00
4   4000000   0.134497 0.00000e+00
5   5000000  16.076500 3.93018e-07
6   6000000 194.319000 5.39236e-07
7   7000000 229.285000 7.47619e-07
8   8000000 232.006000 7.50180e-07
9   9000000 141.415000 9.69740e-07
10 10000000 101.108000 1.36020e-06

to check that it worked.

Analysing my data

Because my data is in a data frame, it is very easy to work with in R.

If I want to plot the difference between doubles and floats, for example, as a function of N, I simply use

> plot(d.f ~ N, experiments)

and get the following result:

Not the nicest plot ever, but good enough for now.

There is a strange bend in the plot around N=3e7.  This might be the point where adding the current sum to a new number between 0 and 1 just returns the current sum (since the precision is too small to update the current sum).

To see what happens before that point, we can extract a subset of the data, say all the rows where N < 3e7, like this:

small <- experiments[experiments$N < 3e7,]
plot(d.f ~ N, small)

that looks like this:

Weird, the double - float seems to drop before the point where the difference starts to grow linearly.  I don't really know why, but in any case I am more interested in the absolute difference, and I can get that as

plot(abs(d.f) ~ N, small)

We can now try to figure out how the difference grows as a function of N.

The simplest thing we can think of is fitting a line.  Yes, I know that we can see from the plot that a line is a bad fit, but let's try it anyway.

To do this, we construct a linear model of the data like this:

> m <- lm(abs(d.f) ~ N, small)
> summary(m)

Call:
lm(formula = abs(d.f) ~ N, data = small)

Residuals:
   Min     1Q Median     3Q    Max
-768.8 -415.2  112.3  369.4  716.5 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) -5.884e+02  1.814e+02  -3.243  0.00314 **
N            7.949e-05  1.056e-05   7.525 4.28e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 475.9 on 27 degrees of freedom
Multiple R-squared: 0.6771,    Adjusted R-squared: 0.6652
F-statistic: 56.63 on 1 and 27 DF,  p-value: 4.282e-08

This tells us that the coefficient with this approximation to the data would be 7.9e-5.

We can plot the line on top of the data points like this

plot(abs(d.f) ~ N, small)
abline(m)

and, not surprisingly, see that it doesn't fit particularly well.

Just for the fun of it, and to see if we can get a better model, we can split the data into N < 1.5e7 and N >= 1.5e7.  That seems to be where the bend is.

> below.1.5e7 <- small[small$N < 1.5e7,]
> above.1.5e7 <- small[small$N >= 1.5e7,]
>
> m.below <- lm(abs(d.f) ~ N, below.1.5e7)
> summary(m.below)

Call:
lm(formula = abs(d.f) ~ N, data = below.1.5e7)

Residuals:
   Min     1Q Median     3Q    Max
-92.86 -55.28 -34.67  37.42 142.14 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.858e+01  4.724e+01   0.817    0.430
N           6.938e-06  5.548e-06   1.251    0.235

Residual standard error: 83.68 on 12 degrees of freedom
Multiple R-squared: 0.1153,    Adjusted R-squared: 0.04156
F-statistic: 1.564 on 1 and 12 DF,  p-value: 0.2350 

> m.above <- lm(abs(d.f) ~ N, above.1.5e7)
> summary(m.above)

Call:
lm(formula = abs(d.f) ~ N, data = above.1.5e7)

Residuals:
    Min      1Q  Median      3Q     Max
-337.94 -164.43   67.58  104.40  383.17 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) -3.284e+03  3.072e+02  -10.69 8.26e-08 ***
N            1.985e-04  1.370e-05   14.49 2.12e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 229.3 on 13 degrees of freedom
Multiple R-squared: 0.9417,    Adjusted R-squared: 0.9372
F-statistic: 209.9 on 1 and 13 DF,  p-value: 2.120e-09

The plot looks like this:

The two lines I highlighted above gives us the coefficients of the lines and the statistical significance of them.

For the points below 1.5e7, the coefficient is positive, but not significantly different from zero, so for that interval we would conclude that the numerical error from using floats instead of doubles is probably negligible.  For the points above 1.5e7 the coefficient is higher than before and very significant.

Now let's take a look at the difference between double and long double:

plot(abs(l.d) ~ N, experiments)

For the first few values, it seems like there is no difference between double and long double

> experiments[experiments$l.d == 0,]
        N       d.f l.d
1 1000000 -4.642890   0
2 2000000  2.903480   0
3 3000000 19.452800   0
4 4000000  0.134497   0

and after that the difference starts growing up to a point, after which it drops before it increases again.

We could try fitting lines to it again, say below and above 5e7:

> m.below.5e7 <- lm(abs(l.d) ~ N, experiments[experiments$N < 5e7,])
> summary(m.below.5e7)

Call:
lm(formula = abs(l.d) ~ N, data = experiments[experiments$N <
    5e+07, ])

Residuals:
       Min         1Q     Median         3Q        Max
-2.436e-06 -3.437e-07 -3.549e-08  3.586e-07  1.829e-06 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) -1.061e-07  2.273e-07  -0.467    0.643   
N            1.577e-13  7.914e-15  19.930   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 7.834e-07 on 47 degrees of freedom
Multiple R-squared: 0.8942,    Adjusted R-squared: 0.8919
F-statistic: 397.2 on 1 and 47 DF,  p-value: < 2.2e-16 

> m.above.5e7 <- lm(abs(l.d) ~ N, experiments[experiments$N >= 5e7,])
> summary(m.above.5e7)

Call:
lm(formula = abs(l.d) ~ N, data = experiments[experiments$N >=
    5e+07, ])

Residuals:
       Min         1Q     Median         3Q        Max
-5.862e-06 -2.507e-06  4.010e-07  2.080e-06  7.596e-06 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  1.053e-05  2.432e-06   4.329 7.56e-05 ***
N           -6.480e-14  3.205e-14  -2.022   0.0488 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 3.271e-06 on 48 degrees of freedom
Multiple R-squared: 0.07846,    Adjusted R-squared: 0.05926
F-statistic: 4.087 on 1 and 48 DF,  p-value: 0.04882

In both cases we get significant coefficients, but for the latter I simply refuse to believe that the difference between long doubles and doubles decreases when I add more and more numbers.

Always remember that significance only means that the data doesn't fit your null model, it doesn't mean that the alternative model fits the data!

It is pretty clear here that a line is a poor fit for the upper half of the data, but in general you can always check (and always should check) a linear model by plotting it:

opar <- par(mfrow=c(2,2))
plot(m.above.5e7)
par(opar)

In this case, the top left plot is screaming out that there is something wrong, since the difference between residuals and fitted values are greater at the left and right than they are in the middle (with a proper fit, they should be normally distributed around 0).

With this data, I would probably just assume that above 5e7 there is no obvious change in the difference between using long double and double within the range of N values I have tested.  Clearly there is a relationship between N and the difference I see, but it is not an obvious increase and a decrease just does not match my prior belief, so I stick to that conclusion in spite of the evidence.  At least until I can come up with a better fitting model.

Ok, that is an example of how I would use R to analyse some data.  Hope you enjoyed it, Amir.

--

100-123=-23