What if we talked about p-hacking the way we talk about experimenter effects?

Discussions about p-hacking sometimes go sideways. A hypothetical exchange might go like this:

READER: Those p-values are all hovering just below .05, I bet the authors p-hacked.

AUTHOR: I know that I did not p-hack, and I resent the accusation.

By comparison, consider how we talk about another form of potential bias: experimenter effects.

It is widely accepted that experimenters’ expectations, beliefs, or other characteristics can influence participants in behavioral experiments and medical trials. We also accept that this can happen without intent or even awareness on the part of the experimenter. Expectations about how participants receiving a treatment are supposed to differ from those receiving a placebo might show up in the experimenter’s behavior in subtle ways that could influence the participants.

We also don’t have a complete theory of experimenter effects that allows us to reliably measure every manifestation or predict with high confidence when they will and won’t occur. So instead, we consider them as an open possibility in a wide range of situations. As a result, it is also widely accepted that using procedural safeguards against experimenter effects is a best practice in most experiments where a human experimenter will interact with subjects.

Because of all these shared assumptions, discussions around experimenter effects are often much less heated. If you are presenting a study design at lab meeting, and someone says “you’ll need to keep your RAs blind to condition, here’s an idea how to do that…” that’s generally considered a helpful suggestion rather than an insinuation of planned malfeasance.

And even after a study is done, it is generally considered fair game to ask about blinding and other safeguards, and incorporate their presence or absence into an evaluation of a study. If a study lacks such safeguards, authors generally don’t say things like “I would never stoop so low as to try to influence my participants, how dare you!” Everybody, including authors, understands that experimenters don’t always know how they might be influencing subjects. And when safeguards are missing, readers typically treat it as a reason for doubt and uncertainty. We allow and even expect readers to calibrate that uncertainty judgment based on other assumptions or information, like how plausible the effect seems, how strong or weak did partial or incomplete safeguards seem, etc.

For some reason though, when it comes to potential sources of bias in data analysis, we have not (yet) reached a place where we can talk about it in a similar way. This is despite the fact that it has a lot in common with experimenter effects.

It is certainly possible for somebody to deliberately and strategically p-hack, just like it’s possible for an experimenter to wink and nudge and say “are you sure you’re not feeling better?” or whatever. But bias in data analysis does not have to happen that way. Analysts do not have to have intention or even awareness in order to do things that capitalize on chance.

Consider, first of all, that almost every data analysis involves many decisions: what data to include or exclude, whether or how to transform it, a zillion possibilities in specifying the analysis (what particular variables to look at, what analyses to run on them, whether to use one- or two-tailed tests, what covariates to include, which main, interactive, simple, or contrast effect[s] to treat as critical tests of the hypothesis, etc.), and then decisions about what to report. We psychologists of all people know that you cannot un-know something. So once the analyst has seen anything about the data – distributions, scatterplots, preliminary or interim analyses, whatever else – all the subsequent decisions will be made by a person who has that knowledge. And after that point, it is simply impossible for anybody – including the analyst – to state with any confidence how those decisions might otherwise have been made without that knowledge. Which means that we have to treat seriously the possibility that the analyst made decisions that overfit the analyses to the data.

More subtly, as Gelman and Loken discuss in their “forking paths” paper, bias is not defined by a behavior (how many analyses did you run?), but by a set of counterfactuals (how many analyses could you have run?). So even if the objective history is that one and only one analysis was run, that is not a guarantee of no bias.

What all of this means is that when it comes to bias in data analysis, we are in very much a similar situation as with experimenter effects. It is virtually impossible to measure or observe it happening in a single instance, even by the person doing the data analysis. But what we can do is define a broad set of circumstances where we have to take it seriously as a possibility.

It would be great if we could collectively shift our conversations around this issue. I think that would involve changes from both critical readers and from authors.

Start by considering procedures, not behavior or outcomes. Were safeguards in place, and if so, how effective were they? For bias in data analysis, the most common safeguard is preregistration. The mere existence of a preregistration (as indicated by a badge or an OSF link in a manuscript) tells you very little though – many of them do not actually constrain bias. Sometimes that is even by design (for example, preregistering an exploratory study is a great way to prevent editors or reviewers from pressuring you to HARK later on). A preregistration is just a transparency step, you have to actually read it to find out what it does. In order for a preregistration to prevent analytic bias, it has to do two things. First, it has to have a  decision inventory – that is, it has to identify all of the decisions about what data to collect/analyze, how to analyze it, and what to report. So ask yourself: is there a section on exclusions? Transformations? Does it say what the critical test is? Etc. (This will be easier to do in domains where you are familiar with the analytic workflow for the research area. It can also be aided by consulting templates. And if authors write and post analysis code as part of a preregistration, that can make things clear too.) Second, the preregistration has to have a plan for all of those decision points. To the extent that the inventory is complete and the plans are specific and were determined separate from the data, the preregistration can be an effective safeguard against bias.

When safeguards are missing or incomplete, everyone – authors and readers alike -should treat analytic bias as a serious possibility. If there is no preregistration or other safeguards, then bias is possible. If there is a preregistration but it was vague or incomplete, bias is also possible. In a single instance it is often impossible to know what actually happened, for the reasons I discussed above. It can be reasonable to start looking at indirect stuff like statistical evidence (like the distribution of p-values), whether the result is a priori implausible, etc. Inferences about these things should be made with calibrated uncertainty. p-curves are neither perfect nor useless; improbable things really do happen though by definition rarely; etc. So usually we should not be too sure in any direction.

Inferences about authors should be rare. We should have a low bar for talking about science and a high bar for talking about scientists. This cuts both ways. Casual talk challenging authors’ competence, intentions, unreported behaviors, etc. is often both hurtful and unjustified when we are talking about single papers.* But also, authors’ positive assertions about their character, behavior, etc. rarely shed light and can have the perverse effect of reinforcing the message that they, and not just the work, are a legitimate part of the conversation. As much as possible, make all the nouns in your discussion things like “the results,” “the procedure,” etc. and not “the authors” (or for that matter “my critics”). And whether you are an author, a critic, or even an observer, you can point out when people are talking about authors and redirect the conversation to the work.

I realize this last item draws a razor-thin line and maybe sometimes it is no line at all. After all, things like what safeguards were in place, and what happened if they weren’t, are results of the researcher’s behavior. So even valid criticism implicates what the authors did or didn’t do, and it will likely be personally uncomfortable for them. But it’s a distinction that’s worth observing as much as you can when you criticize work or respond to criticisms. And I would hope we’ve learned from the ways we talk about experimenter effects that it is possible to have less heated, and frankly more substantive, discussions about bias when we do that.

Finally, it is worth pointing out that preregistration and other safeguards are still really new to psychology and many other scientific fields. We are all still learning, collectively, how to do them well. That means that we need to be able to criticize them openly, publicly, and vigorously – if we do not talk about them, we cannot get better at doing them. But it also means that some preregistration is almost always better than none, because even a flawed or incomplete one will increase transparency and make it possible to criticize work more effectively. Even as we critique preregistrations that could have been done better, we should recognize that anybody who makes that critique and improvement possible has done something of value.

* In the bigger picture, for better or worse, science pins career advancement, resources, prestige, etc. to people’s reputations. So at some point we have to be able to talk about these things. This is a difficult topic and not something I want to get into here, other than to say that discussions about who is a good scientist are probably better left to entirely separate conversations from ones where we scientifically evaluate single papers, because the evidentiary standards and consequences are so different.

Pre-publication peer review can fall short anywhere

The other day I wrote about a recent experience participating in post-publication peer review. Short version: I picked up on some errors in a paper published in PLOS ONE, which led to a correction. In my post I made the following observation:

Is this a mark against pre-publication peer review? Obviously it’s hard to say from one case, but I don’t think it speaks well of PLOS ONE that these errors got through. Especially because PLOS ONE is supposed to emphasize “a high technical standard” and reporting of “sufficient detail” (the reason I noticed the issue with the SDs was because the article did not report effect sizes).

But this doesn’t necessarily make PLOS ONE worse than traditional journals like Psychological Science or JPSP, where similar errors get through all the time and then become almost impossible to correct.

My intention was to discuss pre- and post-publication peer review generally, and I went out of my way to cite evidence that mistakes can happen anywhere. But some comments I’ve seen online have characterized this as a mark against PLOS ONE (and my “I don’t think it speaks well of PLOS ONE” phrasing probably didn’t help). So I would like to note the following:

1. After my blog post went up yesterday, somebody alerted me that the first author of the PLOS ONE paper has posted corrections to 3 other papers on her personal website. The errors are similar to what happened at PLOS ONE. She names authors and years, not full citations, but through a little deduction with her CV it appears that one of the journals is Psychological Science, one of them is the Journal of Personality and Social Psychology, and the third could be either JPSP, Personality and Social Psychology Bulletin, or the Journal of Experimental Social Psychology. So all 3 of the corrected papers were in high-impact journals with a traditional publishing model.

2. Some of the errors might look obvious now. But that is probably boosted by hindsight. It’s important to keep in mind that reviewers are busy people who are almost always working pro bono. And even at its best, the review process is always going to be a probabilistic filter. I certainly don’t check the math on every paper I read or review. I was looking at the PLOS ONE paper with a particular mindset that made me especially attentive to power and effect sizes. Other reviewers with different concerns might well have focused on different things. That doesn’t mean that we should throw up our hands, but in the big picture we need to be realistic about what we can expect of any review process (and design any improvements with that realism in mind).

3. In the end, what makes PLOS ONE different is that their online commenting system makes it possible for many eyes to be involved in a continuous review process — not just 2-3 reviewers and an editor before publication and then we’re done. That seems much smarter about the probabilistic nature of peer review. And PLOS ONE makes it possible to address potential errors quickly and transparently and in a way that is directly linked from the published article. Whereas with the other 3 papers, assuming that those corrections have been formally submitted to the respective journals, it could still be quite a while before they appear in print, and the original versions could be in wide circulation by then.


Reflections on a foray into post-publication peer review

Recently I posted a comment on a PLOS ONE article for the first time. As someone who had a decent chunk of his career before post-publication peer review came along — and has an even larger chunk of his career left with it around — it was an interesting experience.

It started when a colleague posted an article to his Facebook wall. I followed the link out of curiosity about the subject matter, but what immediately jumped out at me was that it was a 4-study sequence with pretty small samples. (See Uli Schimmack’s excellent article The ironic effect of significant results on the credibility of multiple-study articles [pdf] for why that’s noteworthy.) That got me curious about effect sizes and power, so I looked a little bit more closely and noticed some odd things. Like that different N’s were reported in the abstract and the method section. And when I calculated effect sizes from the reported means and SDs, some of them were enormous. Like Cohen’s d > 3.0 level of enormous. (If all this sounds a little hazy, it’s because my goal in this post is to talk about my experience of engaging in post-publication review — not to rehash the details. You can follow the links to the article and comments for those.)

In the old days of publishing, it wouldn’t have been clear what to do next. In principle many psych journals will publish letters and comments, but in practice they’re exceedingly rare. Another alternative would have been to contact the authors and ask them to write a correction. But that relies on the authors agreeing that there’s a mistake, which authors don’t always do. And even if authors agree and write up a correction, it might be months before it appears in print.

But this article was published in PLOS ONE, which lets readers post comments on articles as a form of post-publication peer-review (PPPR). These comments aren’t just like comments on some random website or blog — they become part of the published scientific record, linked from the primary journal article. I’m all in favor of that kind of system. But it brought up a few interesting issues for how to navigate the new world of scientific publishing and commentary.

1. Professional etiquette. Here and there in my professional development I’ve caught bits and pieces of a set of gentleman’s rules about scientific discourse (and yes, I am using the gendered expression advisedly). A big one is, don’t make a fellow scientist look bad. Unless you want to go to war (and then there are rules for that too). So the old-fashioned thing to do — “the way I was raised” — would be to contact the authors quietly and petition them to make a correction themselves, so it could look like it originated with them. And if they do nothing, probably limit my comments to grumbling at the hotel bar at the next conference.

But for PPPR to work, the etiquette of “anything public is war” has to go out the window. Scientists commenting on each other’s work needs to be a routine and unremarkable part of scientific discourse. So does an understanding that even good scientists can make mistakes. And to live by the old norms is to affirm them. (Plus, the authors chose to submit to a journal that allows public comments, so caveat author.) So I elected to post a comment and then email the authors to let them know, so they would have a chance to respond quickly if they weren’t monitoring the comments. As a result, the authors posted several comments over the next couple of days correcting aspects of the article and explaining how the errors happened. And they were very responsive and cordial over email the entire time. Score one for the new etiquette.

2. A failure of pre-publication peer review? Some of the issues I raised in my comment were indisputable factual inconsistencies — like that the sample sizes were reported differently in different parts of the paper. Others were more inferential — like that a string of significant results in these 4 studies was significantly improbable, even under a reasonable expectation of an effect size consistent with the authors’ own hypothesis. A reviewer might disagree about that (maybe they think the true effect really is gigantic). Other issues, like the too-small SDs, would have been somewhere in the middle, though they turned out to be errors after all.

Is this a mark against pre-publication peer review? Obviously it’s hard to say from one case, but I don’t think it speaks well of PLOS ONE that these errors got through. Especially because PLOS ONE is supposed to emphasize “a high technical standard” and reporting of “sufficient detail” (the reason I noticed the issue with the SDs was because the article did not report effect sizes).

But this doesn’t necessarily make PLOS ONE worse than traditional journals like Psychological Science or JPSP, where similar errors get through all the time and then become almost impossible to correct. [UPDATE: Please see my followup post about pre-publication review at PLOS ONE and other journals.]

3. The inconsistency of post-publication peer review. I don’t think post-publication peer review is a cure-all. This whole episode depended in somebody (in this case, me) noticing the anomalies and being motivated to post a comment about them. If we got rid of pre-publication peer review and if the review process remained that unsystematic, it would be a recipe for a very biased system. This article’s conclusions are flattering to most scientists’ prejudices, and press coverage of the article has gotten a lot of mentions and “hell yeah”s on Twitter from pro-science folks. I don’t think it’s hard to imagine that that contributed to it getting a pass, and that if the opposite were true the article would have gotten a lot more scrutiny both pre- and post-publication. In my mind, the fix would be to make sure that all articles get a decent pre-publication review — not to scrap it altogether. Post-publication review is an important new development but should be an addition, not a replacement.

4. Where to stop? Finally, one issue I faced was how much to say in my initial comment, and how much to follow up. In particular, my original comment made a point about the low power and thus the improbability of a string of 4 studies with a rejected null. I based that on some hypotheticals and assumptions rather than formally calculating Schimmack’s incredibility index for the paper, in part because other errors in the initial draft made that impossible. The authors never responded to that particular point, but their corrections would have made it possible to calculate an IC index. So I could have come back and tried to goad them into a response. But I decided to let it go. I don’t have an axe to grind, and my initial comment is now part of the record. And one nice thing about PPPR is that readers can evaluate the arguments for themselves. (I do wish I had cited Schimmack’s paper though, because more people should know about it.)