Stats Tip #2: Don’t Bin (If You Don’t Have To)!

In a paper, published at a peer-reviewed conference, the authors wished to test the hypothesis that their univariate data was distributed according to an exponential distribution. They didn’t get it quite right.

A continuous random variable X ≥ 0 is said to be exponentially distributed with parameter λ if P(X < x) = 1 − exp(−λx). Inter-arrival times of a Poisson process (where events occur at a fixed rate and independent of each other) are distributed exponentially.

Their data consisted of the time between dates at which vulnerability reports were being published, so the hypothesis of exponential distribution wasn’t unreasonable. Their procedure was essentially this:

  1. For four unrelated (but large) bin sizes a < b < c < d, chosen for no good theoretical reasons, do:
  2. Bin the arrival times in bins of sizes a, b, c, and d.
  3. Estimate the exponential parameter λ and compute the expected number of events in each bin if λ were the true parameter.
  4. Perform Chi-square test on binned data.
  5. Report p-value.

Sounds OK? Well, not quite.

The problem is that the original data is essentially continuous (or else distributed into so many bins—days—that it is continuous for all practical purposes).  Binning into larger units will lose information, and it might just lose the kind of local variation that would kill a hypothesis of exponential distribution on the unbinned data.

In this example, we plot a histogram of SAT results with different bin sizes. With a comfortably large bin size of 41, we get a nice plot with one mode:

With a large bin size of 41, a nice distribution emerges.

With a smaller bin size of 25, this vanishes, however, and the interpretation is much less clear:

With a smaller bin size of 25, the nice properties disappear.

In this case, the problem is especially egregious, since there is not even a practical need to bin. It is perfectly easy to check for goodness-of-fit of the data with the exponential distribution using the Kolmogorov-Smirnov test. If x is a vector of interarrival times, doing this in R is a three-liner:

mu <- 1/mean(x) # estimate exponential distribution parameter
c <- function(x) { pexp(x, rate=mu) } # cumulative distribution
ks.test(x, c) # goodness-of-fit test

Of course, if I have a Poisson process, then counting the number of events (not the inter-arrival times) using any fixed bin size will give data that is Poisson distributed, so doing the same testing exercise on the number of events per bin (with a few bin sizes, some large, some small) and then checking for Poisson (not exponential) distribution would actually have been a valid way to test this. Indeed the authors of the paper mentioned above did this also.

So surely, now all is well, right? Well, not quite.

Since we are testing several hypotheses simultaneously here, we would have to use an appropriate correction (this will be the topic of another stats tip).  This, as should not come as a surprise at this point, the authors of the paper did not do.  And even had they wanted to, they couldn’t have done it due to a flaw in their experiment design: they never computed the  p-values directly but instead simply compared the Chi-square values with the critical values for p = 0.01.

Where there are no p-values, there can be no correction to p-values!

Download the script that was used to generate the above figures.

About Stephan Neuhaus

Stephan Neuhaus has been working in security since 1992, when he was a member of the PGP 2.0 development team. He has since been a successful entrepreneur before going back to University where he got his PhD in Software Engineering from Saarbrücken University in 2008. He is now a Senior Researcher at ETH Zurich, where he works on empirical software security in Prof. Plattner's Communication Systems Group.
This entry was posted in Stats Tips. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *