Simple data analysis in R
Friday, April 10th, 2009A few weeks ago Amir sent me a mail with some blog topic suggestions. These included "simple data analysis in R" and "computer science statistics". I'm not sure what the latter means. Maybe some computational approaches to statistics, like sampling techniques or such, but it could also just mean using statistics to figure out some computer science problems.
Anyway, this is probably a bit simpler than what he has in mind, but since it is what I am working on right this moment, I can use this opportunity to blog about it.
I'm still working on getting a feeling for numerical issues in my analysis, and since interval arithmetic failed me (for now), I'm just comparing results when adding numbers in floats, doubles or long doubles.
Playing well with R
I have a C++ program that adds the numbers and outputs the differences with the different precisions.
To make the analysis easy on myself, I always make sure that my program outputs data in a format that is easy to read into R. Sure, R has powerful IO functions, but I don't want to have to mess around with those when I can avoid it, so I always generate the data in the following format:
header1 header2 header3 ... headern val11 val12 val13 ... val1n val21 val22 val23 ... val2n ... ... ... ... ... valm1 valm2 valm3 ... valmn
where "headeri" is the name of a variable of interest and the following rows contain the various values of the variables (so they correspond to data points row wise).
In my C++ program, it looks like this:
std::cout << "N d.f l.d" << std::endl;
for (int N = start; N < stop; N += step) {
float float_x = add_random<float>(N, rng_seed);
double double_x = add_random<double>(N, rng_seed);
long double long_double_x = add_random<long double>(N, rng_seed);
std::cout << N << ' '
<< (double_x - float_x) << ' '
<< (long_double_x - double_x) << std::endl;
}
where I output the number of additions, the difference between double and float (the variable I call d.f) and the difference between long double and float (the variable I call l.d).
Just for completeness, so you can see what I am actually doing, the add_random function looks like this:
template <typename float_type>
float_type add_random(unsigned int N, unsigned int rng_seed)
{
boost::mt19937 rng(rng_seed);
boost::uniform_real<> uni_dist(0,1);
boost::variate_generator<boost::mt19937&, boost::uniform_real<> > random_01(rng, uni_dist);
float_type x = 0.0;
for (int i = 0; i < N; ++i) {
x += random_01();
}
return x;
}
Now, to read in data like this, I can use the function read.table() like this:
experiments <- read.table('experiments.txt', header=TRUE)
where the header=TRUE tells the function that the first line contains headers.
The data in experiments is now a "data frame" containing columns for each variable.
I can get the names of the variables as:
> names(experiments) [1] "N" "d.f" "l.d" >
or the first few rows as
> experiments[1:10,] N d.f l.d 1 1000000 -4.642890 0.00000e+00 2 2000000 2.903480 0.00000e+00 3 3000000 19.452800 0.00000e+00 4 4000000 0.134497 0.00000e+00 5 5000000 16.076500 3.93018e-07 6 6000000 194.319000 5.39236e-07 7 7000000 229.285000 7.47619e-07 8 8000000 232.006000 7.50180e-07 9 9000000 141.415000 9.69740e-07 10 10000000 101.108000 1.36020e-06
to check that it worked.
Analysing my data
Because my data is in a data frame, it is very easy to work with in R.
If I want to plot the difference between doubles and floats, for example, as a function of N, I simply use
> plot(d.f ~ N, experiments)
and get the following result:
Not the nicest plot ever, but good enough for now.
There is a strange bend in the plot around N=3e7. This might be the point where adding the current sum to a new number between 0 and 1 just returns the current sum (since the precision is too small to update the current sum).
To see what happens before that point, we can extract a subset of the data, say all the rows where N < 3e7, like this:
small <- experiments[experiments$N < 3e7,] plot(d.f ~ N, small)
that looks like this:
Weird, the double - float seems to drop before the point where the difference starts to grow linearly. I don't really know why, but in any case I am more interested in the absolute difference, and I can get that as
plot(abs(d.f) ~ N, small)
We can now try to figure out how the difference grows as a function of N.
The simplest thing we can think of is fitting a line. Yes, I know that we can see from the plot that a line is a bad fit, but let's try it anyway.
To do this, we construct a linear model of the data like this:
> m <- lm(abs(d.f) ~ N, small) > summary(m) Call: lm(formula = abs(d.f) ~ N, data = small) Residuals: Min 1Q Median 3Q Max -768.8 -415.2 112.3 369.4 716.5 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.884e+02 1.814e+02 -3.243 0.00314 ** N 7.949e-05 1.056e-05 7.525 4.28e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 475.9 on 27 degrees of freedom Multiple R-squared: 0.6771, Adjusted R-squared: 0.6652 F-statistic: 56.63 on 1 and 27 DF, p-value: 4.282e-08
This tells us that the coefficient with this approximation to the data would be 7.9e-5.
We can plot the line on top of the data points like this
plot(abs(d.f) ~ N, small) abline(m)
and, not surprisingly, see that it doesn't fit particularly well.
Just for the fun of it, and to see if we can get a better model, we can split the data into N < 1.5e7 and N >= 1.5e7. That seems to be where the bend is.
> below.1.5e7 <- small[small$N < 1.5e7,] > above.1.5e7 <- small[small$N >= 1.5e7,] > > m.below <- lm(abs(d.f) ~ N, below.1.5e7) > summary(m.below) Call: lm(formula = abs(d.f) ~ N, data = below.1.5e7) Residuals: Min 1Q Median 3Q Max -92.86 -55.28 -34.67 37.42 142.14 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.858e+01 4.724e+01 0.817 0.430 N 6.938e-06 5.548e-06 1.251 0.235 Residual standard error: 83.68 on 12 degrees of freedom Multiple R-squared: 0.1153, Adjusted R-squared: 0.04156 F-statistic: 1.564 on 1 and 12 DF, p-value: 0.2350 > m.above <- lm(abs(d.f) ~ N, above.1.5e7) > summary(m.above) Call: lm(formula = abs(d.f) ~ N, data = above.1.5e7) Residuals: Min 1Q Median 3Q Max -337.94 -164.43 67.58 104.40 383.17 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.284e+03 3.072e+02 -10.69 8.26e-08 *** N 1.985e-04 1.370e-05 14.49 2.12e-09 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 229.3 on 13 degrees of freedom Multiple R-squared: 0.9417, Adjusted R-squared: 0.9372 F-statistic: 209.9 on 1 and 13 DF, p-value: 2.120e-09
The plot looks like this:
The two lines I highlighted above gives us the coefficients of the lines and the statistical significance of them.
For the points below 1.5e7, the coefficient is positive, but not significantly different from zero, so for that interval we would conclude that the numerical error from using floats instead of doubles is probably negligible. For the points above 1.5e7 the coefficient is higher than before and very significant.
Now let's take a look at the difference between double and long double:
plot(abs(l.d) ~ N, experiments)
For the first few values, it seems like there is no difference between double and long double
> experiments[experiments$l.d == 0,] N d.f l.d 1 1000000 -4.642890 0 2 2000000 2.903480 0 3 3000000 19.452800 0 4 4000000 0.134497 0
and after that the difference starts growing up to a point, after which it drops before it increases again.
We could try fitting lines to it again, say below and above 5e7:
> m.below.5e7 <- lm(abs(l.d) ~ N, experiments[experiments$N < 5e7,]) > summary(m.below.5e7) Call: lm(formula = abs(l.d) ~ N, data = experiments[experiments$N < 5e+07, ]) Residuals: Min 1Q Median 3Q Max -2.436e-06 -3.437e-07 -3.549e-08 3.586e-07 1.829e-06 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.061e-07 2.273e-07 -0.467 0.643 N 1.577e-13 7.914e-15 19.930 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 7.834e-07 on 47 degrees of freedom Multiple R-squared: 0.8942, Adjusted R-squared: 0.8919 F-statistic: 397.2 on 1 and 47 DF, p-value: < 2.2e-16 > m.above.5e7 <- lm(abs(l.d) ~ N, experiments[experiments$N >= 5e7,]) > summary(m.above.5e7) Call: lm(formula = abs(l.d) ~ N, data = experiments[experiments$N >= 5e+07, ]) Residuals: Min 1Q Median 3Q Max -5.862e-06 -2.507e-06 4.010e-07 2.080e-06 7.596e-06 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.053e-05 2.432e-06 4.329 7.56e-05 *** N -6.480e-14 3.205e-14 -2.022 0.0488 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.271e-06 on 48 degrees of freedom Multiple R-squared: 0.07846, Adjusted R-squared: 0.05926 F-statistic: 4.087 on 1 and 48 DF, p-value: 0.04882
In both cases we get significant coefficients, but for the latter I simply refuse to believe that the difference between long doubles and doubles decreases when I add more and more numbers.
Always remember that significance only means that the data doesn't fit your null model, it doesn't mean that the alternative model fits the data!
It is pretty clear here that a line is a poor fit for the upper half of the data, but in general you can always check (and always should check) a linear model by plotting it:
opar <- par(mfrow=c(2,2)) plot(m.above.5e7) par(opar)
In this case, the top left plot is screaming out that there is something wrong, since the difference between residuals and fitted values are greater at the left and right than they are in the middle (with a proper fit, they should be normally distributed around 0).
With this data, I would probably just assume that above 5e7 there is no obvious change in the difference between using long double and double within the range of N values I have tested. Clearly there is a relationship between N and the difference I see, but it is not an obvious increase and a decrease just does not match my prior belief, so I stick to that conclusion in spite of the evidence. At least until I can come up with a better fitting model.
Ok, that is an example of how I would use R to analyse some data. Hope you enjoyed it, Amir.
--
100-123=-23






