Does your p-curve weigh as much as a duck?

Over at Psych Your Mind, Michael Kraus bravely reports the results of a p-curve analysis of his own publications.

p-curves were discussed by Uri Simonsohn at an SPSP symposium on false-positive findings (which I missed but got to read up about thanks to Kraus; many of the authors of the false-positive psychology paper were involved). Simonsohn has a paper forthcoming with details of the method. But the basic idea is that you should be able to tell if somebody is mining their data for significant findings by examining the distribution of p-values in their published work. A big spike of .049s and not enough <.01s could be the result of cherry-picking.

In a thoughtful but sometimes-heated discussion on the SPSP email list between Norbert Schwarz and the symposium participants, Schwarz argues — and I agree — that although p-curve analyses could be a useful tool, they will need to be interpreted cautiously. For example, Schwarz thinks that at this stage it would be inappropriate to base hiring decisions on candidates’ p-curves, something that Simonsohn apparently suggested in his talk.

A big part of the interpretive task is going to be that, as with any metric, users will have to accumulate data and build up some practical wisdom in figuring out how to interpret and apply it. Or to get a little jargony, we’ll have to do some construct validation. In particular, I think it will be crucial to remember that even though you could calculate a p-curve on a single researcher, the curve is not a property of the researcher. Rather, it will reflect the interaction of the researcher with history and context. Even setting aside measurement and sampling error, substantive factors like the incentives and practices set by publishers, granting agencies, and other powerful institutions; differing standards of different fields and subfields (e.g., in their use of NHST, in what people honestly believe and teach as acceptable practices); who the researcher was trained by and has collaborated with, etc. will affect researchers’ p-curves. Individual researchers are an important part of the picture, of course, but it would be a mistake to apply an overly simplistic model of where p-curves come from. (And of course they don’t have to be applied to individuals at all — they could be applied to literatures, to subfields, to journals, or really any way of categorizing publications).

One thing that both Schwarz and Simonsohn seem to agree on is that everybody has probably committed some or many of these errors, and we won’t make much progress unless people are willing to subject themselves to perhaps-painful soul searching. Schwarz in particular fears for a “witch hunt” atmosphere that could make people defensive and ultimately be counterproductive.

So hats off to Kraus for putting himself on the line. I’ll let you read his account and draw your own conclusions, but I think he’s impressively frank, especially for someone that early in his career. Speaking for myself, I’m waiting for Simonsohn’s paper so I can learn a little more about the method before trying it on my own vita. In the mean time I’m glad at least one of my papers has this little bit of p-curve kryptonite:

The p-values associated with the tests of the polynomial models are generally quite small, some so small as to exceed the computational limits of our data analysis software (SPSS 10.0.7, which ran out of decimal places at p < 10e–22).