Accountable replications at Royal Society Open Science: A model for scientific publishing

Kintsugi pottery

Kintsugi pottery. Source: Wikimedia commons.

Six years ago I floated an idea for scientific journals that I nicknamed the Pottery Barn Rule. The name is a reference to an apocryphal retail store policy captured in the phrase, “you break it, you bought it.” The idea is that if you pick something up in the store, you are responsible for what happens to it in your hands.* The gist of the idea in that blog post was as follows: “Once a journal has published a study, it becomes responsible for publishing direct replications of that study. Publication is subject to editorial review of technical merit but is not dependent on outcome.”

The Pottery Barn framing was somewhat lighthearted, but a more serious inspiration for it (though I don’t think I emphasized this much at the time) was newspaper correction policies. When news media take a strong stance on vetting reports of errors and correcting the ones they find, they are more credible in the long run. The good ones understand that taking a short-term hit when they mess up is part of that larger process.**

The core principle driving the Pottery Barn Rule is accountability. When a peer-reviewed journal publishes an empirical claim, its experts have judged the claim to be sound enough to tell the world about. A journal that adopts the Pottery Barn Rule is signaling that it stands behind that judgment. If a finding was important enough to publish the first time, then it is important enough to put under further scrutiny, and the journal takes responsibility to tell the world about those efforts too.

In the years since the blog post, a few journals have picked up the theme in their policies or practices. For example, the Journal of Research in Personality encourages replications and offers an abbreviated review process for studies it had published within the last 5 years. Psychological Science offers a preregistered direct replications submission track for replications of work they’ve published. And Scott Lilienfeld has announced that Clinical Psychological Science will follow the Pottery Barn Rule. In all three instances, these journals have led the field in signaling that they take responsibility for publishing replications.

And now, thanks to hard work by Chris Chambers,*** I was excited to read this morning that the journal Royal Society Open Science has announced a new replication policy that is the fullest implementation yet. Other journals should view the RSOS policy as a model for adoption. In the new policy, RSOS is committing to publishing any technically sound replication of any study it has published, regardless of the result, and providing a clear workflow for how it will handle such studies.

What makes the RSOS policy stand out? Accountability means tying your hands – you do not get to dodge it when it will sting or make you look bad. Under the RSOS policy, editors will still judge the technical faithfulness of replication studies. But they cannot avoid publishing replications on the basis of perceived importance or other subjective factors. Rather, whatever determination the journal originally made about those subjective questions at the time of the initial publication is applied to the replication. Making this a firm commitment, and having it spelled out in a transparent written policy, means that the scientific community knows where the journal stands and can easily see if the journal is sticking to its commitment. Making it a written policy (not just a practice) also means it is more likely to survive past the tenure of the current editors.

Such a policy should be a win both for the journal and for science. For RSOS – and for authors that publish there – articles will now have the additional credibility that comes from a journal saying it will stand by the decision. For science, this will contribute to a more complete and less biased scientific record.

Other journals should now follow suit. Just as readers would trust a news source more if they are transparent about corrections — and less if they leave it to other media to fix their mistakes — readers should have more trust in journals that view replications of things they’ve published as their responsibility, rather than leaving them to other (often implicitly “lesser”) journals. Adopting the RSOS policy, or one like it, will be way for journals to raise the credibility of the work that they publish while they make scientific publishing more rigorous and transparent.


* In reality, the actual Pottery Barn store will quietly write off the breakage as a loss and let you pretend it never happened. This is probably not a good model to emulate for science.

** One difference is that because newspapers report concrete facts, they work from a presumption that they got those things right, and they only publish corrections for errors. Whereas in science, uncertainty looms much larger in our epistemology. We draw conclusions from the accumulation of statistical evidence, so the results of all verification attempts have value regardless of outcome. But the common theme across both domains is being accountable for the things you have reported.

*** You may remember Chris from such films as registered reports and stop telling the world you can cure cancer because of seven mice.

The replication price-drop in social psychology

Why is the replication crisis centered on social psychology? In a recent post, Andrew Gelman offered a list of possible reasons. Although I don’t agree with every one of his answers (I don’t think data-sharing is common in social psych for example), it is an interesting list of ideas and an interesting question.

I want to riff on one of those answers, because it is something I’ve been musing about for a while. Gelman suggests that in social psychology, experiments are comparatively easy and cheap to replicate. Let’s stipulate that this is true of at least some parts of social psych. (Not necessarily all of them – I’ll come back to that.) What would easy and cheap replications do for a field? I’d suggest they have two, somewhat opposing effects.

On the one hand, running replications is the most straightforward way to obtain evidence about whether an effect is replicable.1 So the easier it is to run a replication, the easier it will be to discover if a result is a fluke. Broaden that out, and if a field has lots of replicability problems and replications are generally easy to run, it should be easier to diagnose the field.

But on the other hand, in a field or area where it is easy to run replications, that should facilitate a scientific ecosystem where unreplicable work can get weeded out. So over time, you might expect a field to settle into an equilibrium where by routinely running those easy and cheap replications, it is keeping unreplicable work at a comfortably low rate.2 No underlying replication problem, therefore no replication crisis.

The idea that I have been musing on for a while is that “replications are easy and cheap” is a relatively new development in social psychology, and I think that may have some interesting implications. I tweeted about it a while back but I thought I’d flesh it out.

Consider that until around a decade ago, almost all social psychology studies were run in person. You might be able to do a self-report questionnaire study in mass testing sessions, but a lot of experimental protocols could only be run a few subjects at a time. For example, any protocol that involved interaction or conditional logic (i.e., couldn’t just be printed on paper for subjects to read) required live RAs to run the show. A fair amount of measurement was analog and required data entry. And computerized presentation or assessment was rate-limited by the number of computers and cubicles a lab owned. All of this meant that even a lot of relatively simpler experiments required a nontrivial investment of labor and maybe money. And a lot of those costs were per-subject costs, so they did not scale up well.

All of this changed only fairly recently, with the explosion of internet experimentation. In the early days of the dotcom revolution you had to code websites yourself,3 but eventually companies like Qualtrics sprung up with pretty sophisticated and usable software for running interactive experiments. That meant that subjects could complete many kinds of experiments at home without any RA labor to run the study. And even for in-lab studies, a lot of data entry – which had been a labor-intensive part of running even a simple self-report study – was cut out. (Or if you were already using experiment software, you went from having to buy a site license for every subject-running computer to being able to run it on any device with a browser, even a tablet or phone.) And Mechanical Turk meant that you could recruit cheap subjects online in large numbers and they would be available virtually instantly.

All together, what this means is that for some kinds of experiments in some areas of psychology, replications have undergone a relatively recent and sizeable price drop. Some kinds of protocols pretty quickly went from something that might need a semester and a team of RAs to something you could set up and run in an afternoon.4 And since you weren’t booking up your finite lab space or spending a limited subject-pool allocation, the opportunity costs got lower too.

Notably, growth of all of the technologies that facilitated the price-drop accelerated right around the same time as the replication crisis was taking off. Bem, Stapel, and false-positive psychology were all in 2011. That’s the same year that Buhrmester et al published their guide to running experiments on Mechanical Turk, and just a year later Qualtrics got a big venture capital infusion and started expanding rapidly.

So my conjecture is that the sudden price drop helped shift social psychology out of a replications-are-rare equilibrium and moved it toward a new one. In pretty short order, experiments that previously would have been costly to replicate (in time, labor, money, or opportunity) got a lot cheaper. This meant that there was a gap between the two effects of cheap replications I described earlier: All of a sudden it was easy to detect flukes, but there was a buildup of unreplicable effects in the literature from the old equilibrium. This might explain why a lot of replications in the early twenty-teens were social priming5 studies and similar paradigms that lend themselves to online experimentation pretty well.

To be sure, I don’t think this could by any means be a complete theory. It’s more of a facilitating change along with other factors. Even if replications are easy and cheap, researchers still need to be motivated to go and run them. Social psychology had a pretty strong impetus to do that in 2011, with Bem, Stapel, and False-positive psychology all breaking in short order. And as researchers in social psychology started finding cause for concern in those newly-cheap studies, they were motivated to widen their scope to replicating other studies that had been designed, analyzed, and reported in similar ways but that hadn’t had so much of a price-drop.

To date that broadening-out from the easy and cheap studies hasn’t spread nearly as much to other subfields like clinical psychology or developmental psychology. Perhaps there is a bit of an ingroup/outgroup dynamic – it is easier to say “that’s their problem over there” than to recognize commonalities. And those fields don’t have a bunch of cheap-but-influential studies of their own to get them started internally.6

An optimistic spin on all this is that social psychology could be be on its way to a new equilibrium where running replications becomes more of a normal thing. But there will need to be an accompanying culture shift where researchers get used to seeing replications as part of mainstream scientific work.

Another implication is that the price-drop and resulting shift in equilibrium has created a kind of natural experiment where the weeding-out process has lagged behind the field’s ability to run cheap replications. A boom in metascience research has taken advantage of this lag to generate insights into what does7 and doesn’t8 make published findings less likely to be replicated. Rather than saying “oh that’s those people over there,” fields and areas where experiments are difficult and expensive could and should be saying, wow, we could have a problem and not even know it – but we can learn some lessons from seeing how “those people over there” discovered they had a problem and what they learned about it.


  1. Hi, my name is Captain Obvious. 
  2. Conversely, it is possible that a field where replications are hard and expensive might reach an equilibrium where unreplicable findings could sit around uncorrected. 
  3. RIP the relevance of my perl skills. 
  4. Or let’s say a week + an afternoon if you factor in getting your IRB exemption. 
  5. Yeah, I said social priming without scare quotes. Come at me. 
  6. Though admirably, some researchers in those fields are now trying anyway, costs be damned. 
  7. Selective reporting of underpowered results
  8. Hidden moderators

Bold changes at Psychological Science

Style manuals sound like they ought to be boring things, full of arcane points about commas and whatnot. But Wikipedia’s style manual has an interesting admonition: Be bold. The idea is that if you see something that could be improved, you should dive in and start making it better. Don’t wait until you are ready to be comprehensive, don’t fret about getting every detail perfect. That’s the path to paralysis. Wikipedia is an ongoing work in progress, your changes won’t be the last word but you can make things better.

In a new editorial at Psychological Science, interim editor Stephen Lindsay is clearly following the be bold philosophy. He lays out a clear and progressive set of principles for evaluating research. Beware the “troubling trio” of low power, surprising results, and just-barely-significant results. Look for signs of p-hacking. Care about power and precision. Don’t confuse nonsignificant for null.

To people who have been paying attention to the science reform discussion of the last few years (and its longstanding precursors), none of this is new. What is new is that an editor of a prominent journal has clearly been reading and absorbing the last few years’ wave of careful and thoughtful scholarship on research methods and meta-science. And he is boldly acting on it.

I mean, yes, there are some things I am not 100% in love with in that editorial. Personally, I’d like to see more value placed on good exploratory research.* I’d like to see him discuss whether Psychological Science will be less results-oriented, since that is a major contributor to publication bias.** And I’m sure other people have their objections too.***

But… Improving science will forever be a work in progress. Lindsay has laid out a set of principles. In the short term, they will be interpreted and implemented by humans with intelligence and judgment. In the longer term, someone will eventually look at what is and is not working and will make more changes.

Are Lindsay’s changes as good as they could possibly be? The answers are (1) “duh” because obviously no and (2) “duh” because it’s the wrong question. Instead let’s ask, are these changes better than things have been? I’m not going to give that one a “duh,” but I’ll stand behind a considered “yes.”

———-

* Part of this is because in psychology we don’t have nearly as good a foundation of implicit knowledge and accumulated wisdom for differentiating good from bad exploratory research as we do for hypothesis-testing. So exploratory research gets a bad name because somebody hacks around in a tiny dataset and calls it “exploratory research,” and nobody has the language or concepts to say why they’re doing it wrong. I hope we can fix that. For starters, we could start stealing more ideas from the machine learning and genomics people, though we will need to adapt them for the particular features of our scientific problems. But that’s a blog post for another day.

** There are some nice comments about this already on the ISCON facebook page. Dan Simons brought up the exploratory issue; Victoria Savalei the issue about results-focus. My reactions on these issues are in part bouncing off of theirs.

*** When I got to the part about using confidence intervals to support the null, I immediately had a vision of steam coming out of some of the Twitter Bayesians’ ears.

Moderator interpretations of the Reproducibility Project

The Reproducibility Project: Psychology (RPP) was published in Science last week. There has been some excellent coverage and discussion since then. If you haven’t heard about it,* Ed Yong’s Atlantic coverage will catch you up. And one of my favorite commentaries so far is on Michael Frank’s blog, with several very smart and sensible ways the field can proceed next.

Rather than offering a broad commentary, in this post I’d like to discuss one possible interpretation of the results of the RPP, which is “hidden moderators.” Hidden moderators are unmeasured differences between original and replication experiments that would result in differences in the true, underlying effects and therefore in the observed results of replications. Things like differences in subject populations and experimental settings. Moderator interpretations were the subject of a lengthy discussion on the ISCON Facebook page recently, and are the focus of an op-ed by Lisa Feldman Barrett.

In the post below, I evaluate the hidden-moderator interpretation. The tl;dr version is this: Context moderators are probably common in the world at large and across independently-conceived experiments. But an explicit design goal of direct replication is to eliminate them, and there’s good reason to believe they are rare in replications.

1. Context moderators are probably not common in direct replications

Many social and personality psychologists believe that lots of important effects vary by context out in the world at large. I am one of those people — subject and setting moderators are an important part of what I study in my own work. William McGuire discussed the idea quite eloquently, and it can be captured in an almost koan-like quote from Niels Bohr: “The opposite of one profound truth may very well be another profound truth.” It is very often the case that support for a broad hypothesis will vary and even be reversed over traditional “moderators” like subjects and settings, as well as across different methods and even different theoretical interpretations.

Consider as a thought experiment** what would happen if you told 30 psychologists to go test the deceptively simple-seeming hypothesis “Happiness causes smiling” and turned them loose. You would end up with 30 different experiments that would differ in all kinds of ways that we can be sure would matter: subjects (with different cultural norms of expressiveness), settings (e.g., subjects run alone vs. in groups), manipulations and measures of the IV (film clips, IAPS pictures, frontal asymmetry, PANAS?) and DV (FACS, EMG, subjective ratings?), and even construct definitions (state or trait happiness? eudaimonic or hedonic? Duchenne or social smiles?). You could learn a lot by studying all the differences between the experiments and their results.

But that stands in stark contrast to how direct replications are carried out, including the RPP. Replicators aren’t just turned loose with a broad hypothesis. In direct replication, the goal is to test the hypothesis “If I do the same experiment, I will get the same result.” Sometimes a moderator-ish hypothesis is built in (“this study was originally done with college students, will I get the same effect on Mturk?”). But such differences from the original are planned in. The explicit goal of replication design is for any other differences to be controlled out. Well-designed replication research makes a concerted effort to faithfully repeat the original experiments in every way that documentation, expertise, and common sense say should matter (and often in consultation with original authors too). The point is to squeeze out any room for substantive differences.

Does it work? In a word, yes. We now have data telling us that the squeezing can be very effective. In Many Labs 1 and Many Labs 3 (which I reviewed here), different labs followed standardized replication protocols for a series of experiments. In principle, different experimenters, different lab settings, and different subject populations could have led to differences between lab sites. But in analyses of heterogeneity across sites, that was not the result. In ML1, some of the very large and obvious effects (like anchoring) varied a bit in just how large they were (from “kinda big” to “holy shit”). Across both projects, more modest effects were quite consistent. Nowhere was there evidence that interesting effects wink in and out of detectability for substantive reasons linked to sample or setting.

We will continue to learn more as our field gets more experience with direct replication. But right now, a reasonable conclusion from the good, systematic evidence we have available is this: When some researchers write down a good protocol and other researchers follow it, the results tend to be consistent. In the bigger picture this is a good result for social psychology: it is empirical evidence that good scientific control is within our reach, neither beyond our experimental skills nor intractable for the phenomena we study.***

But it also means that when replicators try to clamp down potential moderators, it is reasonable to think that they usually do a good job. Remember, the Many Labs labs weren’t just replicating the original experiments (from which their results sometimes differed – more on that in a moment). They were very successfully and consistently replicating each other. There could be individual exceptions here and there, but on the whole our field’s experience with direct replication so far tells us that it should be unusual for unanticipated moderators to escape replicators’ diligent efforts at standardization and control.

2. A comparison of a published original and a replication is not a good way to detect moderators

Moderation means there is a substantive difference between 2 or more (true, underlying) effects as a function of the moderator variable. When you design an experiment to test a moderation hypothesis, you have to set things up so you can make a valid comparison. Your observations should ideally be unbiased, or failing that, the biases should be the same at different levels of the moderator so that they cancel out in the comparison.

With the RPP (and most replication efforts), we are trying to interpret observed differences between published original results and replications. The moderator interpretation rests on the assumption that observed differences between experiments are caused by substantive differences between them (subjects, settings, etc.). An alternative explanation is that there are different biases. And that is almost certainly the case. The original experiments are generally noisier because of modest power, and that noise is then passed through a biased filter (publication bias for sure — these studies were all published at selective journals — and perhaps selective reporting in some cases too). By contrast, the replications are mostly higher powered, the analysis plans were pre-registered, and the replicators committed from the outset to publish their findings no matter what the results.

That means that a comparison of published original studies and replication studies in the RPP is a poor way to detect moderators, because you are comparing a noisy and biased observation to one that is much less so.**** And such a comparison would be a poor way to detect moderators even if you were quite confident that moderators were out there somewhere waiting to be found.

3. Moderator explanations of the Reproducibility Project are (now) post hoc

The Reproducibility Project has been conducted with an unprecedented degree of openness. It was started 4 years ago. Both the coordinating plan and the protocols of individual studies were pre-registered. The list of selected studies was open. Original authors were contacted and invited to consult.

What that means is that anyone could have looked at an original study and a replication protocol, applied their expert judgment, and made a genuinely a priori prediction of how the replication results would have differed from the original. Such a prediction could have been put out in the open at any time, or it could have been pre-registered and embargoed so as not to influence the replication researchers.

Until last Friday, that is.

Now the results of the RPP are widely known. And although it is tempting to now look back selectively at “failed” replications and generate substantively interesting reasons, such explanations have to be understood for what they are: untested post hoc speculation. (And if someone now says they expected a failure all along, they’re possibly HARKing too.)

Now, don’t get me wrong — untested post hoc speculation is often what inspires new experiments. So if someone thinks they see an important difference between an original result and a replication and gets an idea for a new study to test it out, more power to them. Get thee to the lab.

But as an interpretation of the data we have in front of us now, we should be clear-eyed in appraising such explanations, especially as an across-the-board factor for the RPP. From a bargain-basement Bayesian perspective, context moderators in well-controlled replications have a low prior probability (#1 above), and comparisons of original and replication studies have limited evidential value because of unequal noise and bias (#2). Put those things together and the clear message is that we should be cautious about concluding that there are hidden moderators lurking everywhere in the RPP. Here and there, there might be compelling, idiosyncratic reasons to think there could be substantive differences to motivate future research. But on the whole, as an explanation for the overall pattern of findings, hidden moderators are not a strong contender.

Instead, we need to face up to the very well-understood and very real differences that we know about between published original studies and replications. The noxious synergy between low power and selective publication is certainly a big part of the story. Fortunately, psychology has already started to make changes since 2008 when the RPP original studies were published. And positive changes keep happening.

Would it be nice to think that everything was fine all along? Of course. And moderator explanations are appealing because they suggest that everything is fine, we’re just discovering limits and boundary conditions like we’ve always been doing.***** But it would be counterproductive if that undermined our will to continue to make needed improvements to our methods and practices. Personally, I don’t think everything in our past is junk, even post-RPP – just that we can do better. Constant self-critique and improvement are an inherent part of science. We have diagnosed the problem and we have a good handle on the solution. All of that makes me feel pretty good.

———

* Seriously though?

** A thought meta-experiment? A gedankengedankenexperiment?

*** I think if you asked most social psychologists, divorced from the present conversation about replicability and hidden moderators, they would already have endorsed this view. But it is nice to have empirical meta-scientific evidence to support it. And to show the “psychology isn’t a science” ninnies.

**** This would be true even if you believed that the replicators were negatively biased and consciously or unconsciously sandbagged their efforts. You’d think the bias was in the other direction, but it would still be unequal and therefore make comparisons of original vs. replication a poor empirical test of moderation. (You’d also be a pretty cynical person, but anyway.)

***** For what it’s worth, I’m not so sure that the hidden moderator interpretation would actually be all that reassuring under the cold light of a rational analysis. This is not the usual assumption that moderators are ubiquitous out in the world. We are talking about moderators that pop up despite concerted efforts to prevent them. Suppose that ubiquitous occult moderators were the sole or primary explanation for the RPP results — so many effects changing when we change so little, going from one WEIRD sample to another, with maybe 5-ish years of potential for secular drift, and using a standardized protocol. That would suggest that we have a poor understanding of what is going on in our labs. It would also suggest that it is extraordinarily hard to study main effects or try to draw even tentative generalizable conclusions about them. And given how hard it is to detect interactions, that would mean that our power problem would be even worse than people think it is now.

Replicability in personality psychology, and the symbiosis between cumulative science and reproducible science

There is apparently an idea going around that personality psychologists are sitting on the sidelines having a moment of schadenfreude during the whole social psychology Replicability Crisis thing.

Not true.

The Association for Research in Personality conference just wrapped up in St. Louis. It was a great conference, with lots of terrific research. (Highlight: watching three of my students give kickass presentations.) And the ongoing scientific discussion about openness and reproducibility had a definite, noticeable effect on the program.

The most obvious influence was the (packed) opening session on reproducibility. First, Rich Lucas talked about the effects of JRP’s recent policy of requiring authors to explicitly talk about power and sample size decisions. The policy has had a noticeable impact on sample sizes of published papers, without major side effects like tilting toward college samples or cheap self-report measures.

Second, Simine Vazire talked about the particular challenges of addressing openness and replicability in personality psychology. A lot of the discussion in psychology has been driven by experimental psychologists, and Simine talked about what the general issues that cut across all of science look like when applied in particular to personality psychology. One cool recommendation she had (not just for personality psychologists) was to imagine that you had to include a “Most Damning Result” section in your paper, where you had to report the one result that looked worst for your hypothesis. How would that change your thinking?*

Third, David Condon talked about particular issues for early-career researchers, though really it was for anyone who wants to keep learning – he had a charming story of how he was inspired by seeing one of his big-name intellectual heroes give a major award address at a conference, then show up the next morning for an “Introduction to R” workshop. He talked a lot about tools and technology that we can use to help us do more open, reproducible science.

And finally, Dan Mroczek talked about research he has been doing with a large consortium to try to do reproducible research with existing longitudinal datasets. They have been using an integrated data analysis framework as a way of combining longitudinal datasets to test novel questions, and to look at issues like generalizability and reproducibility across existing data. Dan’s talk was a particularly good example of why we need broad participation in the replicability conversation. We all care about the same broad issues, but the particular solutions that experimental social psychologists identify aren’t going to work for everybody.

In addition to its obvious presence in the plenary session, reproducibility and openness seemed to suffuse the conference. As Rick Robins pointed out to me, there seemed to be a lot more people presenting null findings in a more open, frank way. And talk of which findings were replicated and which weren’t, people tempering conclusions from initial data, etc. was common and totally well received like it was a normal part of science. Imagine that.

One things that stuck out at me in particular was the relationship between reproducible science and cumulative science. Usually I think of the first helping the second; you need robust, reproducible findings as a foundation before you can either dig deeper into process or expand out in various ways. But in many ways, the conference reminded me that the reverse is true as well: cumulative science helps reproducibility.

When people are working on the same or related problems, using the same or related constructs and measures, etc. then it becomes much easier to do robust, reproducible science. In many ways structural models like the Big Five have helped personality psychology with that. For example, the integrated data analysis that Dan talked about requires you to have measures of the same constructs in every dataset. The Big Five provide a common coordinate system to map different trait measures onto, even if they weren’t originally conceptualized that way. Psychology needs more models like that in other domains – common coordinate systems of constructs and measures that help make sense of how different research programs fit together.

And Simine talked about (and has blogged about) the idea that we should collect fewer but better datasets, with more power and better but more labor-intensive methods. If we are open with our data, we can do something really well, and then combine or look across datasets better to take advantage of what other people do really well – but only if we are all working on the same things so that there is enough useful commonality across all those open datasets.

That means we need to move away from a career model of science where every researcher is supposed to have an effect, construct, or theory that is their own little domain that they’re king or queen of. Personality psychology used to be that way, but the Big Five has been a major counter to that, at least in the domain of traits. That kind of convergence isn’t problem-free — the model needs to evolve (Big Six, anyone?), which means that people need the freedom to work outside of it; and it can’t try to subsume things that are outside of its zone of relevance. Some people certainly won’t love it – there’s a certain satisfaction to being the World’s Leading Expert on X, even if X is some construct or process that only you and maybe your former students are studying. But that’s where other fields have gone, even going as far as expanding beyond the single-investigator lab model: Big Science is the norm in many parts of physics, genomics, and other fields. With the kinds of problems we are trying to solve in psychology – not just our reproducibility problems, but our substantive scientific ones — that may increasingly be a model for us as well.

 

———-

* Actually, I don’t think she was only imagining. Simine is the incoming editor at SPPS.** Give it a try, I bet she’ll desk-accept the first paper that does it, just on principle.

** And the main reason I now have footnotes in most of my blog posts.

An open review of Many Labs 3: Much to learn

A pre-publication manuscript for the Many Labs 3 project has been released. The project, with 64 authors and supported by the Center for Open Science, ran replications of 10 previously-published effects on diverse topics. The research was conducted in 20 different university subject pools plus an Mturk comparison sample, with very high statistical power (over 3,000 subjects total). The project was pre-registered, and wherever possible the replicators worked with original authors or other experts to make the replications faithful to the originals. Big kudos to the project coordinating team and all the researchers who participated in this important work, as well as all the original authors who worked with the replicators.

A major goal was to examine whether time of semester moderates effect sizes, testing the common intuition among researchers that subjects are “worse” (less attentive) at the end of the term. But really, there is much more to it than that:

Not much replicated. The abstract says that 7 of 10 effects did not replicate. But dig a little deeper and the picture is more complicated. For starters, only 9 of those 10 effects were direct (or if you prefer, “close”) replications. The other was labeled a conceptual replication and deserves separate handling. More on that below; for now, let’s focus on the 9 direct replications.

Of the 3 effects that were tagged as “successful” replications, one — the Stroop effect — is so well-established that it is probably best thought of as a validity check on the project. If the researchers could not obtain a Stroop effect, I would have been more inclined to believe that they screwed something up than to believe that the Stroop isn’t real. But they got the Stroop effect, quite strongly. Test passed. (Some other side findings from manipulation checks or covariates could also be taken as validation tests. Strong arguments were judged as more persuasive; objectively warmer rooms were judged as warmer; etc.)

One major departure I have from the authors is how they defined a successful replication — as a significant effect in the same direction as the original significant effect. Tallying significance tests is not a good metric because, as the saying goes, the difference between significant and not significant is not necessarily significant. And perhaps less obviously, two significant effects — even in the same direction — can be very different.

I have argued before that the core question of replication is really this: If I do the same experiment, will I observe the same result? So the appropriate test is whether the replication effect equals the original effect, plus or minus error. It was not always possible to obtain an effect size from the original published report. But in many instances where it was, the replication effects were much smaller than the original ones. A particularly striking case of this was the replication of Tversky and Kahneman’s original, classic demonstration of the availability heuristic. The original effect was d = 0.82, 95% CI = [0.47, 1.17]. The replication was d = 0.09, 95% CI = [0.02, 0.16]. So the ML3 authors counted it as a “successful” replication, but the replication study obtained a substantially different effect than the original study.

Does it matter if the results are different but in the same direction? I say it does. Effect sizes matter, for a variety of reasons. Lurking in the background is an important set of meta-science questions about the quality of evidence in published reports. Remember, out of a set of 9 direct replications, 6 could not produce the original effect at all. Against that backdrop, what does it mean when a 7th one’s conclusion was right but the data to support it were not reproducible? You start to wonder if it’s a win for Tversky and Kahneman’s intellects rather than for the scientific process. I don’t want to count on my colleagues’ good guesswork (not all of us are as insightful as Tversky and Kahneman). I want to be persuaded by good data.

For what it’s worth, the other “successful” replication (metaphoric structuring, from a 2000 study by Lera Boroditsky) had a smaller point estimate than the original study, but the original had low power and thus a huge confidence interval (d = 0.07 to 1.20!). Without running the numbers, my guess is that a formal test of the difference between the original and replication effects would not be significant for that reason.

So what’s the box score for the direct replications? On the plus side, 1 gimme and 1 newish effect. In the middle, 1 replication that agreed with the original result in direction but sharply differed in magnitude. And then 6 blanks.

None of the interactions replicated. Of 3 hypothesized interactions, none of them replicated. None, zip, nada. All three were interactions between a manipulated variable and an individual difference. Interactions like that require a lot of statistical power to detect, and the replication effort had much more power than the original. We now know that under conditions of publication bias, published underpowered studies are especially likely to be biased (i.e., false positives). So add this to the growing pile of evidence that researchers and journals need to pay more attention to power.

A particularly interesting case is the replication of a moral credentialing study. In the original study, the authors had hypothesized a main effect of a credentialing manipulation. They didn’t find the main effect, but they did find (unexpectedly) an interaction with gender. By contrast, Many Labs 3  found the originally hypothesized main effect but not the gender interaction. So it is quite possible (if you think ML3 is the more reliable dataset) that the original study obtained both a false negative and a false positive — both potentially costly consequences of running with too little power.

Emblematic problems with a “conceptual replication.” An oddball in the mix was the one “conceptual replication.” The original study looked at associations between two self-report instruments, the NEO PI-R and Cloninger’s TCI, in a sample of psychiatric patients admitted to a hospital emergency unit. The replicators picked out one correlation to focus on — the one between NEO Conscientiousness and TCI Persistence. But they subbed in a 2-item scale from Gosling et al.’s TIPI for the 48-item NEO Conscientiousness scale, and instead of using the TCI they used time spent on an unsolvable anagram task as a measure of “persistence.” And they looked at all this in university subject pools and an Mturk sample.

The NEO-to-TIPI exchange is somewhat defensible, as the TIPI has been validated against the NEO PI-R.* But the shorter measure should be expected to underestimate any effects.

Swapping in the unsolvable anagrams for the TCI Persistence scale is a bit more mystifying. Sure, people sometimes call the anagram task a measure of “persistence” but (a) it’s a one-off behavioral measure not optimized to measuring stable individual differences, and (b) since there is no literature cited (or that I know of) connecting the anagram task to the TCI, there is a serious risk of falling prey to the jingle fallacy. In other words, just because the thing measured by the TCI and the thing measured by the anagrams are called “persistence” by different researchers, that does not mean they are the same construct. (For example, goals are hierarchical. Planful “persistence” at higher-order goals might be very different than wheel-spinning “persistence” on narrow tasks. Indeed, one of the 2 cited studies on the anagrams task used disengagement from the anagrams as an index of adaptive behavior.)

These problems, I think, are emblematic of problems with how “conceptual replication” is often treated. First, direct and conceptual replication are not the same thing and should not be treated as such. Direct replication is about the reproducibility of scientific observations; conceptual replication is about putting a theory to the test in more than one way. Both are a necessary part of science but they are not substitutable. Second, in this case the conceptual replication is problematic on its own terms. So many things changed from the original to the “replication” that it is difficult to interpret the difference in results. Is the issue that the TIPI was swapped in for the NEO? The anagrams for the TCI? Is it an issue of going from self-report to tested behavior? From an aggregate measure of personality to a one-shot measure? Psychiatric emergency patients to undergrads? Conceptual replication is much more useful when it is incremental and systematic.

The same interpretive difficulty would be present if the ML3 result had appeared to replicate the original, but I suspect we would be much less likely to notice it. We might just say hooray, it replicated — when it is entirely possible that we would actually be looking at 2 completely unrelated effects that just happened to have gotten the same labels.

Another blow against the “hidden moderator” defense. So far I haven’t touched on the stated purpose of this project, which was to look at time-of-semester as a moderator of effects. The findings are easy to summarize: it turns out it there was no there there. The authors also tested for heterogeneity of effects due to order (since subjects got the tasks for the 10 replications in random order) or across samples and labs. There was no there there either. Not even in the 2 tasks that were administered in person rather than by computers and thus were harder to standardize.

The authors put the interpretation so well that I’ll quote them at length here [emphasis added]:

A common explanation for the challenges of replicating results across samples and settings is that there are many seen and unseen moderators that qualify the detectability of effects (Cesario, 2014). As such, when differences are observed across study administrations, it is easy to default to the assumption that it must be due to features differing between the samples and settings. Besides time of semester, we tested whether the site of data collection, and the order of administration during the study session moderated the effects. None of these had a substantial impact on any of the investigated effects. This observation is consistent with the first “Many Labs” study (Klein et al., 2014) and is the focus of the second (Klein et al., 2015). The present study provides further evidence against sample and setting differences being a default explanation for variation in replicability. That is not to deny that such variation occurs, just that direct evidence for a given effect is needed to demonstrate that it is a viable explanation.

When effects do not replicate, “maybe there were moderators” needs to be treated as what it is – a post hoc speculation for future testing. Moderator explanations are worth consideration on a case by case basis, and hunting for them could lead to genuine discoveries. But as evidence keeps accruing that sample and setting moderators are not commonplace, we should be more and more skeptical of those kinds of explanations offered post hoc. High-powered pre-registered replications are very credible. And when the replication results disagree with the original study, our interpretations should account for what we know about the statistical properties of the studies and the way we came to see the evidence, and give less weight to untested speculation.

—–

* An earlier version incorrectly stated that the TIPI had only been validated indirectly against the NEO. Thanks to Sam Gosling for pointing out the error.

Popper on direct replication, tacit knowledge, and theory construction

I’ve quoted some of this before, but it was buried in a long post and it’s worth quoting at greater length and on its own. It succinctly lays out his views on several issues relevant to present-day discussions of replication in science. Specifically, Popper makes clear that (1) scientists should replicate their own experiments; (2) scientists should be able to instruct other experts how to reproduce their experiments and get the same results; and (3) establishing the reproducibility of experiments (“direct replication” in the parlance of our times) is a necessary precursor for all the other things you do to construct and test theories.

Kant was perhaps the first to realize that the objectivity of scientific statements is closely connected with the construction of theories — with the use of hypotheses and universal statements. Only when certain events recur in accordance with rules or regularities, as is the case with repeatable experiments, can our observations be tested — in principle — by anyone. We do not take even our own observations quite seriously, or accept them as scientific observations, until we have repeated and tested them. Only by such repetitions can we convince ourselves that we are not dealing with a mere isolated ‘coincidence’, but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable.

Every experimental physicist knows those surprising and inexplicable apparent ‘effects’ which in his laboratory can perhaps even be reproduced for some time, but which finally disappear without trace. Of course, no physicist would say in such a case that he had made a scientific discovery (though he might try to rearrange his experiments so as to make the effect reproducible). Indeed the scientifically significant physical effect may be defined as that which can be regularly reproduced by anyone who carries out the appropriate experiment in the way prescribed. No serious physicist would offer for publication, as a scientific discovery, any such ‘occult effect,’ as I propose to call it — one for whose reproduction he could give no instructions. The ‘discovery’ would be only too soon rejected as chimerical, simply because attempts to test it would lead to negative results. (It follows that any controversy over the question whether events which are in principle unrepeatable and unique ever do occur cannot be decided by science: it would be a metaphysical controversy.)

– Karl Popper (1959/2002), The Logic of Scientific Discovery, pp. 23-24.