The usability of statistics; or, what happens when you think that (p=.05) != (p=.06)

The difference between significant and not significant is not itself significant.

That is the title of a 2006 paper by statisticians Andrew Gelman and Hal Stern. It is also the theme of a new review article in Nature Neuroscience by Sander Nieuwenhuis, Birte U Forstmann, and Eric-Jan Wagenmakers (via Gelman’s blog). The review examined several hundred papers in behavioral, systems, and cognitive neuroscience. Of all the papers that tried to compare two effects, about half of them made this error instead of properly testing for an interaction.

I don’t know how often the error makes it through to published papers in social and personality psychology, but I see it pretty regularly as a reviewer. I call it when I see it; sometimes other reviewers call it out too, sometimes they don’t.

I can also remember making this error as a grad student – and my advisor correcting me on it. But the funny thing is, it’s not something I was taught. I’m quite sure that nowhere along the way did any of my teachers say you can compare two effects by seeing if one is significant and the other isn’t. I just started doing it on my own. (And now I sometimes channel my old advisor and correct my own students on the same error, and I’m sure nobody’s teaching it to them either.)

If I wasn’t taught to make this error, where was I getting it from? When we talk about whether researchers have biases, usually we think of hot-button issues like political bias. But I think this reflects a more straightforward kind of bias — old habits of thinking that we carry with us into our professional work. To someone without scientific training, it seems like you should be able to ask “Does X cause Y, yes or no?” and expect a straightforward answer. Scientific training teaches us a couple of things. First, the question is too simple: it’s not a yes or no question; the answer is always going to come with some uncertainty; etc. Second, the logic behind the tool that most of us use – null hypothesis significance testing (NHST) – does not even approximate the form of the question. (Roughly: “In a world where X has zero effect on Y, would we see a result this far from the truth less than 5% of the time?”)

So I think what happens is that when we are taught the abstract logic of what we are doing, it doesn’t really pervade our thinking until it’s been ground into us through repetition. For a period of time – maybe in some cases forever – we carry out the mechanics of what we have been taught to do (run an ANOVA) but we map it onto our old habits of thinking (“Does X cause Y, yes or no?”). And then we elaborate and extrapolate from them in ways that are entirely sensible by their own internal logic (“One ANOVA was significant and the other wasn’t, so X causes Y more than it causes Z, right?”).

One of the arguments you sometimes hear against NHST is that it doesn’t reflect the way researchers think. It’s a sort of usability argument: NHST is the butterfly ballot of statistical methods. In principle, I don’t think that argument carries the day on its own (if we need to use methods and models that don’t track our intuitions, we should). But it should be part of the discussion. And importantly, the Nieuwenhuis et al. review shows us how using unintuitive methods can have real consequences.

7 thoughts on “The usability of statistics; or, what happens when you think that (p=.05) != (p=.06)

  1. I see this occasionally in neuroimaging papers. Usually it happens when researchers hypothesize an interaction (e.g., group A will show the effect to a greater extent than group B), then show results separately for group A (activation in the hypothesized region) and group B (no activation) without ever directly comparing the two groups. Neuroimaging adds another layer to the “NHST bias” you discuss because our pretty activation graphics obscure the fact that they are just thresholded p-value maps. It always looks dramatic to show a blob on one brain and no blob on the other (e.g., for groups A and B, respectively), but that tells you nothing about the relative difference between the groups.

    I have started to add this warning (to my ever-expanding list of high-horse snarky warnings) to graduate statistics classes: A difference in significances does not imply a significant difference.

  2. bayesianbiologist, thanks for the link – very interesting. In my rough paraphrase of NHST logic, I should have said “…would we see a result AT LEAST this far from the truth…”

    Elliot, the 2-groups comparison is just one version of the problem. What about when a contrast is said to show an effect in one region but not another? I don’t know enough about fMRI analysis to be sure, but sometimes in talks or papers it sounds like what happened was that the activation was above the threshold in Region A but below threshold in Region B, rather than there being a formal test.

    Also, you’re going to have to duke it out with Gelman over who has boiled this problem down to the pithier slogan.

  3. Hi, Sanjay. It’s not just p=.05 vs. p=.06! Everybody knows that the .05 distinction is arbitrary. The point Hal and I were making was that even apparently large differences in p-values are not statistically significant. For example, if you have one study with z=2.5 (almost significant at the 1% level!) and another with z=1 (not statistically significant at all, only 1 se from zero!), then their difference has a z of about 1 (again, not statistically significant at all). So it’s not just a comparison of 0.05 vs. 0.06, even a difference between _clearly significant_ and _clearly not significant_ can be clearly not statistically significant.

  4. Thanks for the great posts (and comments)! You’ve encouraged me to emphasize these points in my research methods classes.

    A related error (more common among undergraduate students than in professional journals) is the tendency to believe that “not statistically significant” = “definitely not true.”

Comments are closed.