The Reproducibility Project: Psychology (RPP) was published in Science last week. There has been some excellent coverage and discussion since then. If you haven’t heard about it,* Ed Yong’s Atlantic coverage will catch you up. And one of my favorite commentaries so far is on Michael Frank’s blog, with several very smart and sensible ways the field can proceed next.
Rather than offering a broad commentary, in this post I’d like to discuss one possible interpretation of the results of the RPP, which is “hidden moderators.” Hidden moderators are unmeasured differences between original and replication experiments that would result in differences in the true, underlying effects and therefore in the observed results of replications. Things like differences in subject populations and experimental settings. Moderator interpretations were the subject of a lengthy discussion on the ISCON Facebook page recently, and are the focus of an op-ed by Lisa Feldman Barrett.
In the post below, I evaluate the hidden-moderator interpretation. The tl;dr version is this: Context moderators are probably common in the world at large and across independently-conceived experiments. But an explicit design goal of direct replication is to eliminate them, and there’s good reason to believe they are rare in replications.
1. Context moderators are probably not common in direct replications
Many social and personality psychologists believe that lots of important effects vary by context out in the world at large. I am one of those people — subject and setting moderators are an important part of what I study in my own work. William McGuire discussed the idea quite eloquently, and it can be captured in an almost koan-like quote from Niels Bohr: “The opposite of one profound truth may very well be another profound truth.” It is very often the case that support for a broad hypothesis will vary and even be reversed over traditional “moderators” like subjects and settings, as well as across different methods and even different theoretical interpretations.
Consider as a thought experiment** what would happen if you told 30 psychologists to go test the deceptively simple-seeming hypothesis “Happiness causes smiling” and turned them loose. You would end up with 30 different experiments that would differ in all kinds of ways that we can be sure would matter: subjects (with different cultural norms of expressiveness), settings (e.g., subjects run alone vs. in groups), manipulations and measures of the IV (film clips, IAPS pictures, frontal asymmetry, PANAS?) and DV (FACS, EMG, subjective ratings?), and even construct definitions (state or trait happiness? eudaimonic or hedonic? Duchenne or social smiles?). You could learn a lot by studying all the differences between the experiments and their results.
But that stands in stark contrast to how direct replications are carried out, including the RPP. Replicators aren’t just turned loose with a broad hypothesis. In direct replication, the goal is to test the hypothesis “If I do the same experiment, I will get the same result.” Sometimes a moderator-ish hypothesis is built in (“this study was originally done with college students, will I get the same effect on Mturk?”). But such differences from the original are planned in. The explicit goal of replication design is for any other differences to be controlled out. Well-designed replication research makes a concerted effort to faithfully repeat the original experiments in every way that documentation, expertise, and common sense say should matter (and often in consultation with original authors too). The point is to squeeze out any room for substantive differences.
Does it work? In a word, yes. We now have data telling us that the squeezing can be very effective. In Many Labs 1 and Many Labs 3 (which I reviewed here), different labs followed standardized replication protocols for a series of experiments. In principle, different experimenters, different lab settings, and different subject populations could have led to differences between lab sites. But in analyses of heterogeneity across sites, that was not the result. In ML1, some of the very large and obvious effects (like anchoring) varied a bit in just how large they were (from “kinda big” to “holy shit”). Across both projects, more modest effects were quite consistent. Nowhere was there evidence that interesting effects wink in and out of detectability for substantive reasons linked to sample or setting.
We will continue to learn more as our field gets more experience with direct replication. But right now, a reasonable conclusion from the good, systematic evidence we have available is this: When some researchers write down a good protocol and other researchers follow it, the results tend to be consistent. In the bigger picture this is a good result for social psychology: it is empirical evidence that good scientific control is within our reach, neither beyond our experimental skills nor intractable for the phenomena we study.***
But it also means that when replicators try to clamp down potential moderators, it is reasonable to think that they usually do a good job. Remember, the Many Labs labs weren’t just replicating the original experiments (from which their results sometimes differed – more on that in a moment). They were very successfully and consistently replicating each other. There could be individual exceptions here and there, but on the whole our field’s experience with direct replication so far tells us that it should be unusual for unanticipated moderators to escape replicators’ diligent efforts at standardization and control.
2. A comparison of a published original and a replication is not a good way to detect moderators
Moderation means there is a substantive difference between 2 or more (true, underlying) effects as a function of the moderator variable. When you design an experiment to test a moderation hypothesis, you have to set things up so you can make a valid comparison. Your observations should ideally be unbiased, or failing that, the biases should be the same at different levels of the moderator so that they cancel out in the comparison.
With the RPP (and most replication efforts), we are trying to interpret observed differences between published original results and replications. The moderator interpretation rests on the assumption that observed differences between experiments are caused by substantive differences between them (subjects, settings, etc.). An alternative explanation is that there are different biases. And that is almost certainly the case. The original experiments are generally noisier because of modest power, and that noise is then passed through a biased filter (publication bias for sure — these studies were all published at selective journals — and perhaps selective reporting in some cases too). By contrast, the replications are mostly higher powered, the analysis plans were pre-registered, and the replicators committed from the outset to publish their findings no matter what the results.
That means that a comparison of published original studies and replication studies in the RPP is a poor way to detect moderators, because you are comparing a noisy and biased observation to one that is much less so.**** And such a comparison would be a poor way to detect moderators even if you were quite confident that moderators were out there somewhere waiting to be found.
3. Moderator explanations of the Reproducibility Project are (now) post hoc
The Reproducibility Project has been conducted with an unprecedented degree of openness. It was started 4 years ago. Both the coordinating plan and the protocols of individual studies were pre-registered. The list of selected studies was open. Original authors were contacted and invited to consult.
What that means is that anyone could have looked at an original study and a replication protocol, applied their expert judgment, and made a genuinely a priori prediction of how the replication results would have differed from the original. Such a prediction could have been put out in the open at any time, or it could have been pre-registered and embargoed so as not to influence the replication researchers.
Until last Friday, that is.
Now the results of the RPP are widely known. And although it is tempting to now look back selectively at “failed” replications and generate substantively interesting reasons, such explanations have to be understood for what they are: untested post hoc speculation. (And if someone now says they expected a failure all along, they’re possibly HARKing too.)
Now, don’t get me wrong — untested post hoc speculation is often what inspires new experiments. So if someone thinks they see an important difference between an original result and a replication and gets an idea for a new study to test it out, more power to them. Get thee to the lab.
But as an interpretation of the data we have in front of us now, we should be clear-eyed in appraising such explanations, especially as an across-the-board factor for the RPP. From a bargain-basement Bayesian perspective, context moderators in well-controlled replications have a low prior probability (#1 above), and comparisons of original and replication studies have limited evidential value because of unequal noise and bias (#2). Put those things together and the clear message is that we should be cautious about concluding that there are hidden moderators lurking everywhere in the RPP. Here and there, there might be compelling, idiosyncratic reasons to think there could be substantive differences to motivate future research. But on the whole, as an explanation for the overall pattern of findings, hidden moderators are not a strong contender.
Instead, we need to face up to the very well-understood and very real differences that we know about between published original studies and replications. The noxious synergy between low power and selective publication is certainly a big part of the story. Fortunately, psychology has already started to make changes since 2008 when the RPP original studies were published. And positive changes keep happening.
Would it be nice to think that everything was fine all along? Of course. And moderator explanations are appealing because they suggest that everything is fine, we’re just discovering limits and boundary conditions like we’ve always been doing.***** But it would be counterproductive if that undermined our will to continue to make needed improvements to our methods and practices. Personally, I don’t think everything in our past is junk, even post-RPP – just that we can do better. Constant self-critique and improvement are an inherent part of science. We have diagnosed the problem and we have a good handle on the solution. All of that makes me feel pretty good.
* Seriously though?
** A thought meta-experiment? A gedankengedankenexperiment?
*** I think if you asked most social psychologists, divorced from the present conversation about replicability and hidden moderators, they would already have endorsed this view. But it is nice to have empirical meta-scientific evidence to support it. And to show the “psychology isn’t a science” ninnies.
**** This would be true even if you believed that the replicators were negatively biased and consciously or unconsciously sandbagged their efforts. You’d think the bias was in the other direction, but it would still be unequal and therefore make comparisons of original vs. replication a poor empirical test of moderation. (You’d also be a pretty cynical person, but anyway.)
***** For what it’s worth, I’m not so sure that the hidden moderator interpretation would actually be all that reassuring under the cold light of a rational analysis. This is not the usual assumption that moderators are ubiquitous out in the world. We are talking about moderators that pop up despite concerted efforts to prevent them. Suppose that ubiquitous occult moderators were the sole or primary explanation for the RPP results — so many effects changing when we change so little, going from one WEIRD sample to another, with maybe 5-ish years of potential for secular drift, and using a standardized protocol. That would suggest that we have a poor understanding of what is going on in our labs. It would also suggest that it is extraordinarily hard to study main effects or try to draw even tentative generalizable conclusions about them. And given how hard it is to detect interactions, that would mean that our power problem would be even worse than people think it is now.