Stats Tip #4: Make Use of Unparametric Tests

In statistical hypothesis testing, you often have the choice between tests that assume a certain distribution of the underlying data and tests that don’t make these assumptions.  For example, when evaluating a drug trial, you can choose between, e.g., the t-test, which assumes that the experimental data is normally distributed, or the Mann-Whitney U test, otherwise known as the Wilcoxon rank-sum test, which makes no assumption about that distribution. As Rebecca Black might say, “which test should you take?”

The t-test is called a parametric test, since it assumes that the data in question follows a specific, known, distribution under the null hypothesis (perhaps with unknown distribution parameters), in this case the normal distribution. Unfortunately, there is rarely if ever a good theoretical reason for that assumption.  Usually nothing is known about the experimental error and its distribution, let alone about the true population values.

The rank-sum test is called an unparametric test, since it makes no assumption about the distribution of the data.  Generally, unparametric tests make fewer assumptions than parametric tests.

Now it is true that if you know the distribution of your data, then parametric tests will give you closer bounds and better estimates, but if the assumptions turn out not to be true, the results can be misleading or wrong. And as I have remarked above, it is rare to know the distribution of your data from theoretical reasons.

As David Colquhoun has remarked in his excellent statistics textbook (which, by the way, is available for download and which is so full of good common sense that I have yet to see a better one), statistical tests exist to prevent you from making a fool of yourself.  And the fewer assumptions you make, the less chance there is of fooling yourself and your readers. Hence I recommend that when in doubt, use unparametric tests. The nature of things will make it so that you almost always will be in doubt.

Of course, as Colquhoun also observes, “if you the distribution is known (not assumed but known), then use the appropriate parametric test”.  If the distribution is not known, then sometimes, with small samples, a real effect might not give a statistically significant result, but that merely means that “it is a disadvantage not to know the distribution” and that this “does not constitute a disadvantage of nonparametric tests”.

Additionally, nonparametric tests can often be as powerful as parametric tests. Let’s try an example.  Assume that we are testing blood pressure reduction medication, and that we have 20 people, randomly assigned to a treatment group and a control group. The treatment group gets the drug and the control group gets a placebo. After two weeks, the systolic blood pressure is measured in those 20 patients, and this is the result:

Control Group:

1 2 3 4 5 6 7 8 9 10
145 152 134 121 147 171 133 148 162 155

Treatment Group:

1 2 3 4 5 6 7 8 9 10
137 144 128 113 139 163 125 140 154 147

If you look at the table closely, you will see that I have cheated. I have simply subtracted 8 from the control group’s values.  This means that because of the way the datasets are constructed, there is a real difference in means. Let’s see if the test finds it.

Is our data normally distributed? Here is a Q-Q plot of the control group data, which doesn’t look so bad:

Q-Q plot for blood pressure data

A Shapiro-Wilk normality test yields a p-value of 0.98, so the data is as nicely normal as a data set consisting only of ten values can be. Now our null hypothesis is that the treatment had no effect.  And we are in luck because (1) our data is normally distributed, (2) the two samples are independent (one presumes), (3) the sample sizes are equal, and (4) the variances in the data are equal. We know that last one because of the way I cheated with the data sets.  Therefore, we can test using R:


> bp.control <- c(145, 152, 134, 121, 147, 171, 133, 148, 162, 155)
> bp.treatment <- c(137, 144, 126, 113, 139, 163, 125, 140, 154, 147)
> t.test(bp.control, bp.treatment, var.equal = TRUE, alternative = "two.sided")

Two Sample t-test

data: bp.control and bp.treatment
t = 1.2198, df = 18, p-value = 0.2383
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-5.778803 21.778803
sample estimates:
mean of x mean of y
146.8 138.8

Holy smoke! Even with all these nice assumptions we only get a p-value of 24%!  The small data set and the large variation within that data set have completely masked the small but real reduction! Let’s see the Wilcoxon rank-sum test then.


> wilcox.test(bp.control, bp.treatment, exact=FALSE, alternative="two.sided")

Wilcoxon rank sum test with continuity correction

data: bp.control and bp.treatment
W = 65.5, p-value = 0.2567
alternative hypothesis: true location shift is not equal to 0

Why does R call the Wilcoxon test wilcox.test and not wilcoxon.test? No idea. Anyway, the unparametric test, with no assumptions at all about the underlying distribution, comes to essentially the same conclusion as the t-test, which makes quite strong assumptions.

About Stephan Neuhaus

Stephan Neuhaus has been working in security since 1992, when he was a member of the PGP 2.0 development team. He has since been a successful entrepreneur before going back to University where he got his PhD in Software Engineering from Saarbrücken University in 2008. He is now a Senior Researcher at ETH Zurich, where he works on empirical software security in Prof. Plattner's Communication Systems Group.
This entry was posted in Stats Tips. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *