Stats Tip #1: Plot The Data!

I’ve recently seen a paper, published at a peer-reviewed conference, where the authors argue that some inter-arrival times were exponentially distributed and that the number of events per unit time were poisson distributed. They did some statistical tests and concluded that the evidence was not enough to discard either hypothesis.

Sounds OK? Well, not quite.

Notwithstanding that their findings kinda directly contradict mine (because my tests, admittedly on related but not identical data, show that the inter-arrival times are definitely not exponentially distributed), they committed a huge sin against clarity by not plotting their raw data. Modern packages like R or SPSS make this trivial even for large datasets, and, if tuned properly, can produce production-quality plots, so there is no excuse not to show the reader your data.

Plotting the data has a number of advantages:

It shows outliers that will otherwise confound parameter estimates, in this case, it seems to be a simple data entry error, since all other values are close to 100, but this one is close to 1000:

Plotting the raw data will show outliers, in this case probably a data entry error.

It will show nuances that aggregate statistics will remove, such as in this density plot of a distribution that has two modes, something you will never learn if you only look at aggregate statistics like means and standard deviations:

Plotting the raw data will reveal that the data is bimodal. Aggregate values like the mean would not.

It allows readers to arrive at their own conclusions about your claim. For example, if you report only the trend for this data, your readers will be unaware of the seasonal variation in it:

What do the seasonal variations mean? If you only give the trend, readers cannot form their own opinions.

It keeps the authors honest and it will show you when conclusions are obviously wrong, such as when you plot a Q-Q plot of a distribution that is not normal, no matter how normal you want it to be:

This is not a normal distribution, although you might want it to be. Plotting the data will keep you honest.

I particularly like the free program R and the ggplot2 package that is available for it.  Now, R has a steep learning curve and ggplot2 often frustrates to the point that you want to bang your head against something, but when you finally get them to do what you want, the results are beautiful. I urge you to invest the time to become familiar with at least one good plotting package.

A related point concerns the correct output format for plots, and indeed all kinds of graphics. Please, please use a vector format like SVG, Postscript, or PDF. For example, the plots on this page were produced with an SVG package for R, even though the images were converted to PNG for inline embedding.  Having said that, do not use pixel formats like pnm, png, gif, tiff, or jpeg as the native output formats for your plots. They will not scale. I repeat: they will not scale. There will come the day when you are asked to make a poster with your data and you will be magnifying your plots far beyond their originally intended sizes. Vector graphics look crisp at any magnification. Pixel graphics, not.

Chances are that you’re going to use pdflatex, so I recommend PDF and \usepackage{graphicx}. If necessary, embed fonts to avoid surprises (I once had all my math fonts replaced by Courier. Not nice.)

Download the script used to make the plots on this page.

About Stephan Neuhaus

Stephan Neuhaus has been working in security since 1992, when he was a member of the PGP 2.0 development team. He has since been a successful entrepreneur before going back to University where he got his PhD in Software Engineering from Saarbrücken University in 2008. He is now a Senior Researcher at ETH Zurich, where he works on empirical software security in Prof. Plattner's Communication Systems Group.
This entry was posted in Stats Tips. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *