Show me the data

Visualizing Data per group or condition


Nothing seems simpler than showing people how data look like. Yet, in many cases we use some ugly graphs that do not show the data, instead we show summary statistics and other useless or non-intuitive measures  (yes I'm talking standard errors or 95% confidence intervals).

I'm guilty like everyone else. Recently, many psychologists, brain imagers, statisticians engaged in working out better ways to present data, and violin plots seem to be the right way to go.

What do you need to show


For starter you need to show the data, i.e. scatter plots. Then, you want to check if you have outliers. Finally, you need to show how the data are distributed. There many ways to do that, like histograms. But here we want to show that on top of the scatter plots, for that we can use kernel density estimates. The advantage of KDE is that is shows the 'population' from which data are coming from (well it is supposed to).

Now that we show the data, we need to show the summary statistics estimator. By default, most people use the mean but if the distribution is skewed, that is not necessarily a good option. Anyway, once you have this, you need of course to display the variation of that estimator.

My solution


I wrote a Matlab code do just that. In the following I explain what are the defaults, and why I think these are the best choices.


Outliers detection

This is a pretty complicated field, and each method has pros and cons. Most often there is a balance to find between false positives and false negatives (see for instance some simulations I have done here).  Here I choose to use S-outliers, which are based on the distance between each pair of data points and has the advantage of not assuming symmetry of distributions.

Density estimation

I tried many functions before writing this one. One of the reason that pushed me to do so relates to this: the densities are not bounded in many cases ! and that's just silly. So you show your data, say reaction times, and you have your plots going in the negative values. Same with accuracy, your density goes beyond 100%. Come on, that is not possible.  Here by default I use an histogram. Not any, but a random average shifted histogram (RASH). That means it is not parametric (no a-priori shape), and it is bounded.

Estimators and Intervals

The function by defaults uses the 5th decile of the Harrell Davis estimator, which basically is the median. Finally, I plot the 95% high density intervals using a Bayesian bootstrap. This has the advantage to give you a straightforward interpretation of what you see: this is the 95% prob. of the estimator.

Bonus

The colors are picked up from a scale I make using cube-helix. That means, once printed in gray scale, it still shows differences, but of intensity only (more color-blind friendly).

References


Bourel et al. (2014) Random average shifted histograms. Computational Statistics & Data Analysis, 79, 149-164, http://dx.doi.org/10.1016/j.csda.2014.05.004.

Morey et al. (2016) The fallacy of placing confidence in confidence intervals
Psychon Bull Rev 23: 103. doi:10.3758/s13423-015-0947-8

Harrell & Davis (1982). A new distribution-free quantile estimator Biometrika (1982) 69 (3): 635-640 doi:10.1093/biomet/69.3.635 

Rousseeuw & Croux (1993). Alternatives to the the median absolute deviation. Journal of the American Statistical Association, 88 (424) p 1273-1263



Comments

Popular Posts