Ravioli, Spaghetti, Bonferroni, Canneloni, Lasagna. All delicious products made from semola di grano duro. Add tomato sauce, put some grana on top, serve with Chianti. (Or Teroldego if, like me, you’ve spent some time in the Trentino.) Mjam!
While you’re chewing your dinner, you want to know if a particular jelly bean colour is associated with acne. There are 20 colours, so you start with purple, and find no association (p > 0.05). You go on to brown, pink, blue, teal, salmon, red, turquoise, magenta, yellow, grey, tan, and cyan, and all come back negative (p > 0.05). Then you come to green and indeed you find a significant link between green jelly beans and acne! Yay! Just to be sure, you test the remaining colours (mauve, beige, lilac, black, peach, and orange), and again find no association. Now you write up your results, announcing an association between green jelly beans and acne. Fame and fortune await.
This is the story behind this xkcd and if you didn’t get it and wondered why it’s funny, this post is for you.
To see why the result you found is not significant, ask yourself this question: “If the null hypothesis of “no association” between treatment and outcome is true, and if my significance level is p, how many independent tests do I have to make until I get a result that’s significant at level p, purely by chance?”
Let me rephrase that.
Ask yourself instead, “What does it mean when we say ‘this result is significant at the 5% level'”? It means that we will reject a null hypothesis even if it’s true in 5% of the cases. And 5% is one in twenty. Just like in the comic. In other words, the association between green jelly beans and acne might be real, but if it’s not, one significant result out of 20 is exactly what we’d expect to happen at a 5% significance level, since that is what “5% significance level” means! So you can’t use this result as evidence for such an association.
So what can you do?
This is the time to reveal that Bonferroni is not, in fact, a pasta dish. It is a way to modify the p-values or the significance levels for the individual tests when you later want to speak about the whole family of tests that you did.
Basically, the Bonferroni correction means that if you do n tests, and you want an overall significance of β, you need to perform each test at a significance of α = β/n. Alternatively, if you do all your tests at level α, then the overall significance will be β = αn. (If you want to be picky, α is really 1 − (1 − β)1/n, which for small β is close to β/n, and β = 1 − (1 − α)n, which for small α is αn.)
In our example, α = 0.05 and n = 20, so β = 1 (in the picky version, it’s 0.64). In other words, the whole test is meaningless, something we’ve already found out above.
Caution is necessary when applying a Bonferroni correction, however. The overall null hypothesis in case of Bonferroni is always “all null hypotheses are true simultaneously”, which, when rejected, leaves “at least one null hypothesis is false”. Unfortunately, you won’t know which one. Also, while keeping the rate of type I errors (wrongly rejecting a true null hypothesis) at the overall significance level, the rate of type II errors (wrongly not rejecting a false null hypothesis) increases.
If you want to know which ones to reject at an overall rate of alpha, use the Holm-Bonferroni correction. This works as follows.
- Make your tests and compute the actual p-values for each (not
just if they’re less than or greater than α).
- Order your hypotheses in ascending order of p-values.
- Set k ← number of hypotheses, and repeat step 4 for k, k − 1, …, 1.
- If pk < α/k, reject hypothesis k, set k ← k − 1. Otherwise, stop and don’t reject the remaining hypotheses.
Let’s do our jelly bean example with the Holm-Bonferroni method. What would the p-value of the test with the green jelly beans have to be to reject the null hypothesis of “no association”? In step 3, we compare that p-value with α/k. Now α = 0.05 and k = 20, so p would have to be less than 1/400 = 0.0025 to reject this null hypothesis.
To be fair, correctly using corrections to p-values is still tricky, and the canonical paper about them is Perneger, “What’s wrong with Bonferroni adjustments”, British Medical Journal, 1998:316:1236–1238. A worthwhile read.