Does your p-curve weigh as much as a duck?

Over at Psych Your Mind, Michael Kraus bravely reports the results of a p-curve analysis of his own publications.

p-curves were discussed by Uri Simonsohn at an SPSP symposium on false-positive findings (which I missed but got to read up about thanks to Kraus; many of the authors of the false-positive psychology paper were involved). Simonsohn has a paper forthcoming with details of the method. But the basic idea is that you should be able to tell if somebody is mining their data for significant findings by examining the distribution of p-values in their published work. A big spike of .049s and not enough <.01s could be the result of cherry-picking.

In a thoughtful but sometimes-heated discussion on the SPSP email list between Norbert Schwarz and the symposium participants, Schwarz argues — and I agree — that although p-curve analyses could be a useful tool, they will need to be interpreted cautiously. For example, Schwarz thinks that at this stage it would be inappropriate to base hiring decisions on candidates’ p-curves, something that Simonsohn apparently suggested in his talk.

A big part of the interpretive task is going to be that, as with any metric, users will have to accumulate data and build up some practical wisdom in figuring out how to interpret and apply it. Or to get a little jargony, we’ll have to do some construct validation. In particular, I think it will be crucial to remember that even though you could calculate a p-curve on a single researcher, the curve is not a property of the researcher. Rather, it will reflect the interaction of the researcher with history and context. Even setting aside measurement and sampling error, substantive factors like the incentives and practices set by publishers, granting agencies, and other powerful institutions; differing standards of different fields and subfields (e.g., in their use of NHST, in what people honestly believe and teach as acceptable practices); who the researcher was trained by and has collaborated with, etc. will affect researchers’ p-curves. Individual researchers are an important part of the picture, of course, but it would be a mistake to apply an overly simplistic model of where p-curves come from. (And of course they don’t have to be applied to individuals at all — they could be applied to literatures, to subfields, to journals, or really any way of categorizing publications).

One thing that both Schwarz and Simonsohn seem to agree on is that everybody has probably committed some or many of these errors, and we won’t make much progress unless people are willing to subject themselves to perhaps-painful soul searching. Schwarz in particular fears for a “witch hunt” atmosphere that could make people defensive and ultimately be counterproductive.

So hats off to Kraus for putting himself on the line. I’ll let you read his account and draw your own conclusions, but I think he’s impressively frank, especially for someone that early in his career. Speaking for myself, I’m waiting for Simonsohn’s paper so I can learn a little more about the method before trying it on my own vita. In the mean time I’m glad at least one of my papers has this little bit of p-curve kryptonite:

The p-values associated with the tests of the polynomial models are generally quite small, some so small as to exceed the computational limits of our data analysis software (SPSS 10.0.7, which ran out of decimal places at p < 10e–22).

Whew!

5 thoughts on “Does your p-curve weigh as much as a duck?”

Another thing to consider is the fact that some of us avoid the multiple p-value farce because that is a misuse of p-values. I make my students use p<.05 even if it is significant to the p = .000000000000 level, because that is the logic of NHST. You set a p value as a decision rule, not as a glamor rating of "how significant" your finding is.

Good point, though if you were willing to spend some time on it, you could get around that by calculating exact p-values (if the papers contain enough information to do so).

Nice. I suggested essentially the same idea (as I’m sure other people have as well) here; nice to see it getting some traction. As I noted in my post, I think the best way to use this type of index will be in combination with a bunch of other automated quality metrics. Bias in reported p values is probably going to be moderately correlated with a bunch of other not-so-desirable practices (e.g., using highly variable sample sizes across studies, which is indicative of peeking), so if you throw together a bunch of these kinds of metrics, you should eventually end up with a pretty reasonable and highly automated way of assessing a researcher’s methodological rigor.

Brent Roberts noted the idea of using a p-value (say p≤0.05=alpha) as a decision rule.

If you are going to make a decision, you should be using decision theory, not hypothesis testing with p-values. The reason is that real decisions involve not only the probabilities, but also the loss or utility of making the decision under the states of nature that your probability model describes. Just picking a particular alpha level as a decision rule isn’t adequate in the real world.

I did a quick (crude) simulation and it looks as though you’d need quite a few p values to detect a pattern – thus probably making it unsatisfactory for early career researchers. My simulations are probably underestimating the noise (but I could be wrong – not got time to explore further just now …)

http://psychologicalstatistics.blogspot.com/2012/02/simulating-p-curves-and-detecting-dodgy.html

Thom

Comments are closed.