False-positive psychology five years later

Joe Simmons, Leif Nelson, and Uri Simonsohn have written a 5-years-later[1] retrospective on their “false-positive psychology” paper. It is for an upcoming issue of Perspectives on Psychological Science dedicated to the most-cited articles from APS publications. A preprint is now available.

It’s a short and snappy read with some surprises and gems. For example, footnote 2 notes that the Journal of Consumer Research declined to adopt their disclosure recommendations because they might “dull … some of the joy scholars may find in their craft.” No, really.

For the youngsters out there, they do a good job of capturing in a sentence a common view of what we now call p-hacking: “Everyone knew it was wrong, but they thought it was wrong the way it’s wrong to jaywalk. We decided to write ‘False-Positive Psychology’ when simulations revealed it was wrong the way it’s wrong to rob a bank.”[2]

The retrospective also contains a review of how the paper has been cited in 3 top psychology journals. About half of the citations are from researchers following the original paper’s recommendations, but typically only a subset of them. The most common citation practice is to justify having barely more than 20 subjects per cell, which they now describe as a “comically low threshold” and take a more nuanced view on.

But to me, the most noteworthy passage was this one because it speaks to institutional pushback on the most straightforward of their recommendations:

Our paper has had some impact. Many psychologists have read it, and it is required reading in at least a few methods courses. And a few journals – most notably, Psychological Science and Social Psychological and Personality Science – have implemented disclosure requirements of the sort that we proposed (Eich, 2014; Vazire, 2015). At the same time, it is worth pointing out that none of the top American Psychological Association journals have implemented disclosure requirements, and that some powerful psychologists (and journal interests) remain hostile to costless, common sense proposals to improve the integrity of our field.

Certainly there are some small refinements you could make to some of the original paper’s disclosure recommendations. For example, Psychological Science requires you to disclose all variables “that were analyzed for this article’s target research question,” not all variables period. Which is probably an okay accommodation for big multivariate studies with lots of measures.[3]

But it is odd to be broadly opposed to disclosing information in scientific publications that other scientists would consider relevant to evaluating the conclusions. And yet I have heard these kinds of objections raised many times. What is lost by saying that researchers have to report all the experimental conditions they ran, or whether data points were excluded and why? Yet here we are in 2017 and you can still get around doing that.

 


1. Well, five-ish. The paper came out in late 2011.

2. Though I did not have the sense at the time that everyone knew about everything. Rather, knowledge varied: a given person might think that fiddling with covariates was like jaywalking (technically wrong but mostly harmless), that undisclosed dropping of experimental conditions was a serious violation, but be completely oblivious to the perils of optional stopping. And a different person might have had a different constellation of views on the same 3 issues.

3. A counterpoint is that if you make your materials open, then without clogging up the article proper, you allow interested readers to go and see for themselves.

The University of Oregon should re-name Deady and Dunn Halls

Background: My university is currently weighing whether to re-name two buildings on campus. It was prompted to do so by the UO Black Student Task Force, which demanded last year that the buildings be renamed. Deady Hall is an academic building named after Matthew Deady, a politician and later judge who advocated successfully at the founding of our state to exclude Black citizens from residing here. Dunn Hall is a residence hall named after a former professor who was also an Exalted Cyclops of the KKK. Our president, Michael Schill, created a process for making the decision and appointed a commission of historians to study the two figures and their legacy. The commission has issued its report, and now Schill has invited comment from the university community. Below is the comment that I submitted:

Both buildings should be re-named. I find myself very much in agreement with the reasoning that Matthew Dennis stated in his August 21 Register-Guard editorial. Building names are not neutral markers; they are a way to put a name in a place of prominence and honor the namesake. The idea that we need to keep their names on the buildings as “reminders” simply does not stand up to scrutiny. We have a history department to teach us about history. We name buildings for other reasons.

Dunn Hall seems like the obvious case, being named after a former Exalted Cyclops of the KKK. I fear that in doing one obviously right thing, the university will feel morally licensed to “split the difference” and keep Deady. That would be a mistake.

At the founding of our state, Deady actively promoted the exclusion of Black citizens from Oregon. The defense of Deady seems to rest primarily on his later stance toward Chinese immigrants and descendants. The implication is that that somehow erases his lifelong anti-Black racism, as if racism and racial atonement against different groups are fungible. This strikes me as a distinctly White perspective, viewing non-White groups as interchangeable. Will we look in the face of a community that has been harmed, proclaim “But he was decent to those other people!” and expect them to accept that as amends?

The fact is that Deady never made amends for his anti-Black racism, he never disavowed it, and his actions are still reverberating today in a state whose population includes about 2% African Americans. My department (Psychology) has never had an African American tenure-track professor, and I have been told that the same is true across the entire Division of Natural Sciences. While the reasons are surely complex, I will note that when my department has tried to recruit African American faculty, the underrepresentation of African Americans in our community has come up as a challenge in enticing people to move here and make Oregon their home. That underrepresentation is the direct legacy of Matthew Deady’s political activism. This is not a man that the university should be honoring.

Postscript: You can read more about the history of racism in Oregon in Matt Novak’s well-researched article at Gizmodo.

Everything is fucked: The syllabus

PSY 607: Everything is Fucked
Prof. Sanjay Srivastava
Class meetings: Mondays 9:00 – 10:50 in 257 Straub
Office hours: Held on Twitter at your convenience (@hardsci)

In a much-discussed article at Slate, social psychologist Michael Inzlicht told a reporter, “Meta-analyses are fucked” (Engber, 2016). What does it mean, in science, for something to be fucked? Fucked needs to mean more than that something is complicated or must be undertaken with thought and care, as that would be trivially true of everything in science. In this class we will go a step further and say that something is fucked if it presents hard conceptual challenges to which implementable, real-world solutions for working scientists are either not available or routinely ignored in practice.

The format of this seminar is as follows: Each week we will read and discuss 1-2 papers that raise the question of whether something is fucked. Our focus will be on things that may be fucked in research methods, scientific practice, and philosophy of science. The potential fuckedness of specific theories, research topics, etc. will not be the focus of this class per se, but rather will be used to illustrate these important topics. To that end, each week a different student will be assigned to find a paper that illustrates the fuckedness (or lack thereof) of that week’s topic, and give a 15-minute presentation about whether it is indeed fucked.

Grading:

20% Attendance and participation
30% In-class presentation
50% Final exam

Week 1: Psychology is fucked

Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66, 195-244.

Week 2: Significance testing is fucked

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304-1312.

Rouder, J. N., Morey, R. D., Verhagen, J., Province, J. M., & Wagenmakers, E. J. (2016). Is there a free lunch in inference? Topics in Cognitive Science, 8, 520-547.

Week 3: Causal inference from experiments is fucked

Chapter 3 from: Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.

Week 4: Mediation is fucked

Bullock, J. G., Green, D. P., & Ha, S. E. (2010). Yes, but what’s the mechanism?(don’t expect an easy answer). Journal of Personality and Social Psychology, 98, 550-558.

Week 5: Covariates are fucked

Culpepper, S. A., & Aguinis, H. (2011). Using analysis of covariance (ANCOVA) with fallible covariates. Psychological Methods, 16, 166-178.

Westfall, J., & Yarkoni, T. (2016). Statistically controlling for confounding constructs is harder than you think. PloS one, 11, e0152719.

Week 6: Replicability is fucked

Pashler, H., & Harris, C. R. (2012). Is the replicability crisis overblown? Three arguments examined. Perspectives on Psychological Science, 7, 531-536.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.

Week 7: Interlude: Everything is fine, calm the fuck down

Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). Comment on “Estimating the reproducibility of psychological science.” Science, 251, 1037a.

Maxwell, S. E., Lau, M. Y., & Howard, G. S. (2015). Is psychology suffering from a replication crisis? What does “failure to replicate” really mean? American Psychologist, 70, 487-498.

Week 8: Scientific publishing is fucked

Fanelli, D. (2011). Negative results are disappearing from most disciplines and countries. Scientometrics, 90, 891-904.

Ioannidis, J. P. (2005). Why most published research findings are false. PLoS Med, 2, e124.

Week 9: Meta-analysis is fucked

Inzlicht, M., Gervais, W., & Berkman, E. (2015). Bias-Correction Techniques Alone Cannot Determine Whether Ego Depletion is Different from Zero: Commentary on Carter, Kofler, Forster, & McCullough, 2015. Available at SSRN: http://ssrn.com/abstract=2659409 or http://dx.doi.org/10.2139/ssrn.2659409

Van Elk, M., Matzke, D., Gronau, Q. F., Guan, M., Vandekerckhove, J., & Wagenmakers, E. J. (2015). Meta-analyses are no substitute for registered replications: A skeptical perspective on religious priming. Frontiers in Psychology, 6.

Week 10: The scientific profession is fucked

Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The rules of the game called psychological science. Perspectives on Psychological Science, 7, 543-554.

Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7, 615-631.

Finals week

Wear black and bring a #2 pencil.

Don’t change your family-friendly tenure extension policy just yet

pixelated something

If you are an academic and on social media, then over the last weekend your feed was probably full of mentions of an article by economist Justin Wolfers in the New York Times titled “A Family-Friendly Policy That’s Friendliest to Male Professors.”

It describes a study by three economists of the effects of parental tenure extension policies, which give an extra year on the tenure clock when people become new parents. The conclusion is that tenure extension policies do make it easier for men to get tenure, but they unexpectedly make it harder for women. The finding has a counterintuitive flavor – a policy couched in gender-neutral terms and designed to help families actually widens a gender gap.

Except there are a bunch of odd things that start to stick out when you look more closely at the details, and especially at the original study.

Let’s start with the numbers in the NYT writeup:

The policies led to a 19 percentage-point rise in the probability that a male economist would earn tenure at his first job. In contrast, women’s chances of gaining tenure fell by 22 percentage points. Before the arrival of tenure extension, a little less than 30 percent of both women and men at these institutions gained tenure at their first jobs.

Two things caught my attention when I read this. First, that a 30% tenure rate sounded awfully low to me (this is at the top-50 PhD-granting economics departments). Second, that tenure extension policies took the field from parity (“30 percent of both men and women”) to a 6-to-1 lopsided rate favoring men (the effects are percentage points, so it goes to a 49% tenure rate for men vs. 8% for women). That would be a humongous effect size.

Regarding the 30% tenure rate, it turns out the key words are “at their first jobs.” This analysis compared people who got tenure at their first job to everybody else — which means that leaving for a better outside offer is treated the same in this analysis as being denied tenure. So the tenure-at-first-job variable is not a clear indicator of whether the policy is helping or hurting a career. What if you look at the effect of the policy on getting tenure anywhere? The authors did that, and they summarize the analysis succinctly: “We find no evidence that gender-neutral tenure clock stopping policies reduce the fraction of women who ultimately get tenure somewhere” (p. 4). That seems pretty important.

What about that swing from gender-neutral to a 6-to-1 disparity in the at-first-job analysis? Consider this: “There are relatively few women hired at each university during the sample period. On average, only four female assistant professors were hired at each university between 1985 and 2004, compared to 17 male assistant professors” (p. 17). That was a stop-right-there moment for me: if you are an economics department worried about gender equality, maybe instead of rethinking tenure extensions you should be looking at your damn hiring practices. But as far as the present study goes, there are n = 62 women at institutions that never adopted gender-neutral tenure extension policies, and n = 129 at institutions that did. (It’s even worse than that because only a fraction of them are relevant for estimating the policy effect; more on that below). With a small sample there is going to be a lot of uncertainty in the estimates under the best of conditions. And it’s not the best of conditions: Within the comparison group (the departments that never adopted a tenure extension policy), there are big, differential changes in men’s and women’s tenure rates over the study period (1985 to 2004): Over time, men’s tenure rate drops by about 25%, and women’s tenure rate doubles from 12% to 25%. Any observed effect of a department adopting a tenure-extension policy is going to have to be estimated in comparison to that noisy, moving target.

Critically, the statistical comparison of tenure-extension policy is averaged over every assistant professor in the sample, regardless of whether the individual professor used the policy. (The authors don’t have data on who took a tenure extension, or even on who had kids.) But causation is only defined for those individuals in whom we could observe a potential outcome at either level of the treatment. In plain English: “How does this policy affect people” only makes sense for people who could have been affected by the policy — meaning people who had kids as assistant professors, and therefore could have taken an extension if one were available. So if the policy did have an effect in this dataset, we should expect it to be a very small one because we are averaging it with a bunch of cases that by definition could not possibly show the effect. In light of that, a larger effect should make us more skeptical, not more persuaded.

There is also the odd finding that in departments that offered tenure extension policies, men took less time to get to tenure (about 1 year less on average). This is the opposite of what you’d expect if “men who took parental leave used the extra year to publish their research” as the NYT writeup claims. The original study authors offer a complicated, speculative story about why time-to-tenure would not be expected to change in the obvious way. If you accept the story, it requires invoking a bunch of mechanisms that are not measured in the paper and likely would add more noise and interpretive ambiguity to the estimates of interest.

There were still other analytic decisions that I had trouble understanding. For example, the authors excluded people who had 0 or 1 publication in their first 2 years. Isn’t this variance to go into the didn’t-get-tenure side of the analyses? And the analyses includes a whole bunch of covariates without a lot of discussion (and no pre-registration to limit researcher degrees of freedom). One of the covariates had a strange effect: holding a degree from a top-10 PhD-granting institution makes it less likely that you will get tenure in your first job. This does make sense if you think that top-10 graduates are likely to get killer outside offers – but then that just reinforces the lack of clarity about what the tenure-in-first-job variable is really an indicator of.

But when all is said and done, probably the most important part of the paper is two sentences right on the title page:

IZA Discussion Papers often represent preliminary work and are circulated to encourage discussion. Citation of such a paper should account for its provisional character.

The NYT writeup does no such thing; in fact it goes the opposite direction, trying to draw broad generalizations and make policy recommendations. This is no slight against the study’s original authors – it is typical in economics to circulate working papers for discussion and critique. Maybe they’d have compelling responses to everything I said, who knows? But at this stage, I have a hard time seeing how this working paper is ready for a popular media writeup for general consumption.

The biggest worry I have is that university administrators might take these results and run with them. I do agree with the fundamental motivation for doing this study, which is that policies need to be evaluated by their effects. Sometimes superficially gender-neutral policies have disparate impacts when they run into the realities of biological and social roles (“primary caregiver” leave policies being a case in point). It’s fairly obvious that in many ways the academic workplace is not structured to support involved parenthood, especially motherhood. But are tenure extension policies making the problem worse, better, or are they indifferent? For all the reasons outlined above, I don’t think this research gives us an actionable answer. Policy should come from cumulative knowledge, not noisy and ambiguous preliminary findings.

In the meantime, what administrator would not love to be able to put on the appearance of We Are Doing Something by rolling back a benefit? It would be a lot cheaper and easier than fixing disparities in the hiring process, providing subsidized child care, or offering true paid leave. I hope this piece does not license them to do that.


Thanks to Ryan Light and Rich Lucas, who dug into the paper and first raised some of the issues I discuss here in response to my initial perplexed tweet.

Evaluating a new critique of the Reproducibility Project

Over the last five years psychologists have been paying more and more attention to issues that could be diminishing the quality of our published research — things like low power, p-hacking, and publication bias. We know these things can affect reproducibility, but it can be hard to gauge their practical impact. The Reproducibility Project: Psychology (RPP), published last year in Science, was a massive, coordinated effort to produce an estimate of where several of the field’s top journals stood in 2008 before all the attention and concerted improvement began.

The RPP is not perfect, and the paper is refreshingly frank about its limitations and nuanced about its conclusions. But all science proceeds on fallible evidence (there isn’t any other kind), and it has been welcomed by many psychologists as an informative examination of the reliability of our published findings.

Welcomed by many, but not welcomed by all.

In a technical commentary released today in Science, Dan Gilbert, Gary King, Stephen Pettigrew, and Tim Wilson take exception to the conclusions that the RPP authors and many scientists who read it have reached. They offer re-analyses of the RPP, some incorporating outside data. They maintain that the RPP authors’ conclusions are wrong, and on re-examination the data tell us that “the reproducibility of psychological science is quite high.” (The RPP authors published a reply.)

What should we make of it? I read the technical comment, the supplement (as you’ll see there were some surprises in it), the Open Science Collaboration’s reply, Gilbert et al.’s unpublished response to the reply, and I re-read the Many Labs report that plays a critical role in the commentary. Here are my thoughts.

Unpacking the critics’ replication metric

To start with, a key to understanding Gilbert et al.’s critique is understanding the metric of replicability that it uses.

There are many reasons why original and replication studies might get different results. Some are neutral and unavoidable, like sampling error. Some are signs of good things, like scientists pushing into the unknown. But some are problems. That can include a variety of errors and biases in original studies, errors and biases in replications, and systemic problems like publication bias. Just like there are many reasons for originals and replications to get different results, there are many ways to index those differences. Different metrics are sensitive to different things about original and replication studies. Whereas the RPP looked at a number of different metrics, the critique focuses on one: whether the point estimate of the replication effect size falls within the confidence interval of the original.

But in justifying this choice, the critique’s authors misstate what confidence intervals are. They write that 95% of replications should fall within the original studies’ confidence intervals. That just isn’t true – a P% confidence interval does not predict P% success in future replications. To be fair, almost everyone misinterprets confidence intervals. But when they are pivotal to your sole metric of reproducibility and your interpretation hinges on them, it would be good to get them right.

Another issue that is critical to interpreting intervals is knowing that intervals get wider the less data you have. This is never addressed, but the way Gilbert et al. use original studies’ confidence intervals to gauge replicability means that the lower an original study’s power, the easier it will be to “successfully” replicate it. Conversely, a very high-powered original study can “fail to replicate” because of trivial heterogeneity in the effect. Not all replication metrics are vulnerable to this problem. But if you are going to use a replication metric that is sensitive to power in this way, you need to present it alongside other information that puts it in context. Otherwise you can be led seriously astray.

A limited scope with surprising omissions

The RPP is descriptive, observational data about replications. Gilbert et al. try to model the underlying causes. If there are many reasons why original and replication studies can differ, it would make sense to try to model as many of them as possible, or at least the most important ones. Unfortunately, the critique takes a quite narrow, confirmatory approach to modeling differences between original and replication studies. Of all the possible reasons why original and replication studies can differ, it only looks for random error and flaws in replication studies.

This leads to some striking omissions. For example, any scientist can tell you that publication bias is ubiquitous. It creates biases in the results of original published studies, which can make it harder to reproduce their (biased) effects. But it would not have affected the replications in the RPP. Nor would it affect the comparisons among Many Labs replications that Gilbert et al. use as benchmarks (more on that in a moment). Yet the commentary’s re-analyses of replicability make no attempt to detect or account for publication bias anywhere.

If you want to know how much something varies, calculate its variance

Gilbert et al. propose that replications might have variable effects because of differences in study populations or procedures. This is certainly an important issue, and one that has been raised before in interpreting replications.

In order to offer new insight on this issue, Gilbert et al. re-analyze data from Klein et al.’s (2014) Many Labs 1 study to see how often pairs of studies trying to get the same effect had a “successful” replication by the original-study-confidence-interval criterion. Unfortunately, that analysis mixes together power and effect size heterogeneity – they are very different things, and both higher power of original studies and effect size heterogeneity will lower replication success in this kind of analysis. It does not provide a clean estimate of effect variability.

There is a more straightforward way to know if effects varied across Many Labs replication sites: calculate the variance in the effects. Klein et al. report this in their Table 3. The data show that big effects tended to vary across sites but more modest ones did not. And by big I mean big – there are 5 effects in Many Labs 1 with a Cohen’s d greater than 1.0. Four of them are variations on the anchoring effect. Effect sizes that big are quite unusual in social psychology – they were probably included by Klein et al. to make sure there were some slam-dunk effects in the Many Labs project, not because they are representative. But effects that are more typical in size are not particularly variable in Many Labs 1. Nor is there much variance in any of the effects examined in the similar Many Labs 3.

Apples-to-oranges comparisons of replicability from RPP to Many Labs

Another argument Gilbert et al. make is that with enough power, most RPP replications would have been successful. To support this argument they look again at Many Labs to see how often the combined sample of 6000+ participants could replicate the original studies. Here is how they describe it:

OSC attempted to replicate each of 100 studies just once, and that attempt produced an unsettling result: Only 47% of the original studies were successfully replicated (i.e., produced effects that fell within the confidence interval of the original study). In contrast, MLP [Many Labs] attempted to replicate each of its studies 35 or 36 times and then pooled the data. MLP’s much more powerful method produced a much more heartening result: A full 85% of the original studies were successfully replicated. What would have happened to MLP’s heartening result if they had used OSC’s method? Of MLP’s 574 replication studies, only 195 produced effects that fell within the confidence interval of the original, published study. In other words, if MLP had used OSC’s method, they would have reported an unsettling replication rate of 34% rather than the heartening 85% they actually reported.

Three key numbers stand out in this paragraph. The RPP replication rate was 47%. The high-powered (N>6000) Many Labs pooled-sample replication rate was 85%. But if the RPP approach is applied to Many Labs (i.e. looking at single samples instead of the pooled rate), the rate drops to 34%. On its face, that sound like a problem for the RPP.

Except when I actually looked at Table 2 of Many Labs and tried to verify the 85% number for the pooled sample, I couldn’t. There are 15 original studies where a confidence interval could be calculated. Only 6 of the pooled replication effects landed inside the intervals. So the correct number is 40%. Where did 85% come from? Although it’s virtually impossible to tell in the paragraph I quoted above, I found buried in the supplement the key detail that Gilbert et al. got their “heartening” 85% from a totally different replication metric — the tally of of replications that got p < .05 (if you treat the anchoring effects as one, there are 11 significant effects out of 13). Instead of making an apples-to-apples comparison, they switch to a different metric exactly once in their critique, on only one side of this key comparison.

What if instead you calculate the replicability rate using the same metric for both sides of the comparison? Using the confidence interval metric that Gilbert et al. use everywhere else, you get 47% in the RPP versus 40% in the pooled analysis of Many Labs. So the RPP actually did better than Many Labs with its N > 6000 sample sizes. How could that be?

It turns out that the confidence interval metric can lead you to some surprising conclusions. Because larger effects were more variable in Many Labs 1, the effects that did the worst job “replicating” by Gilbert et al’s original-study-confidence-interval criterion are the biggest ones. Thus anchoring – yes, anchoring – “failed to replicate” three out of four times. Gain vs. loss framing failed too. (Take that, Kahneman and Tversky!) By contrast, flag priming would appear to have replicated successfully – even though the original authors themselves have said that Many Labs did not successfully replicate it.

In addition to completely undermining the critique’s conclusion about power, all of this goes back to my earlier point that the confidence-interval metric needs to be interpreted with great caution. In the RPP authors’ reply, they mention bring up differences among replication metrics. In an unpublished response, Gilbert et al. write: “This is a red herring. Neither we nor the authors of OSC­2015 found any substantive differences in the conclusions drawn from the confidence interval measure versus the other measures.” I don’t know what to make of that. How can they think 85% versus 40% is not a substantive difference?

Flaws in a fidelity metric

Another issue raised by the critique is what its authors call the “fidelity” of the replications: how well the replication protocols got the original studies’ methods right. As with variability in populations and procedures, this is an important issue that merits a careful look in any replication study.

The technical comment gives a few examples of differences between original and replication protocols that sound to like they could have mattered in some casese. How did these issues play out in the RPP as a whole? Unfortunately, the critique uses a flawed metric to quantify the effects of fidelity: the original authors’ endorsement of the replication protocol.

There are two problems with their approach. First, original study authors have expertise in the methods, of course. But they also have inside knowledge about flaws in their original studies. The critique acknowledges this problem but makes no attempt to account for it in the analyses.

Second, Gilbert et al. compared “endorsements” to “nonendorsements,” but a majority of the so-called nonendorsements were cases where original authors simply did not respond – an important detail that is again only found in the supplement. Original authors only registered concerns in 11 out of 100 replications, versus 18 nonresponses. Like with any missing-data problem, we do not know what the nonresponders would have said if they had responded. But the analysis assumes that none of the 18 would have endorsed the replication protocols.

A cleaner fidelity metric would have helped. But ultimately, these kinds of indirect analyses can only go so far. Gilbert et al. claim that original studies would replicate just fine if only replicators would get the procedures right. This is an empirical question with a very direct way of getting an answer: go run a replication the way you think it ought to be done. I suspect that some of the studies probably would successfully replicate, either because of Type II error or substantive differences. We could learn a tremendous amount from direct empirical tests of hypotheses about replication fidelity and other hidden moderators, far more than we can from these kinds of indirect analyses with weak proxies.

We can move the conversation forward

In the last 5 years there have been a lot of changes in psychology. We now know that there are problems with how we have sometimes done research in the past. For example, it was long considered okay to analyze small, noisy datasets with a lot of flexibility to look around for patterns that supported a publishable conclusion. There is a lot more awareness now that these practices will lead to lower reproducibility, and the field is starting to do something about that. The RPP came around after we already knew that. But it added meaningfully to that discussion by giving us an estimate of reproducibility in several top journals. It gave us a sense, however rough, of where the field stood in 2008 before we started making changes.

That does not mean psychologists are all of one mind about where psychology is at on reproducibility and what we ought to do about it. There has been a lot of really fruitful discussion recently coming from different perspectives. Some of the critical commentaries raise good concerns and have a lot of things I agree with.

The RPP was a big and complicated project, and given its impact it warrants serious critical analysis from multiple perspectives. I agree with Uri Simonsohn that some of the protocol differences between originals and replications deserve closer scrutiny, and it is good that Gilbert et al. brought them to our attention. I found myself less enthusiastic about their analyses, for the reasons I have outlined here.

But the discussion will continue to move forward. The RPP dataset is still open, and I know there are other efforts under way to draw new insights from it. Even better, there is lots of other, new meta-science happening too. I remain optimistic that as we continue to learn more, we will keep making things better in our field.

* * * * *

UPDATE (3/8/2016): There has been a lot of discussion about the Gilbert et al. technical comment since I put up this blog. Gilbert et al. have written a reply that responds to some of the issues that I and others have raised.

Here are some other relevant discussions in the academic blogosphere:

Reading “The Baby Factory” in context

cherry orchard
Photo credit: Des Blenkinsopp.

Yesterday I put up a post about David Peterson’s ethnography The Baby Factory, an ethnography of 3 baby labs that discusses Peterson’s experience as a participant observer. My post was mostly excerpts, with a short introduction at the beginning and a little discussion at the end. That was mostly to encourage people to go read it. (It’s open-access!)

Today I’d like to say a little more.

How you approach the article probably depends a lot on what background and context you come to it with. It would be a mistake to look to an ethnography for a generalizable estimate of something about a population, in this case about how common various problematic practices are. That’s not what ethnography is for. But at this point in history, we are not lacking for information about the ways we need to improve psychological science. There have been surveys and theoretical analyses and statistical analyses and single-lab replications and coordinated many-lab replications and all the rest. It’s getting harder and harder to claim that the evidence is cherry-picked without seriously considering the possibility that you’re in the middle of a cherry orchard. As Simine put it so well:

even if you look at your own practices and those of everyone you know, and you don’t see much p-hacking going on, the evidence is becoming overwhelming that p-hacking is happening a lot. my guess is that the reason people can’t reconcile that with the practices they see happening in their labs and their friends’ labs is that we’re not very good at recognizing p-hacking when it’s happening, much less after the fact. we can’t rely on our intuitions about p-hacking. we have to face the facts. and, in my view, the facts are starting to look pretty damning.

You don’t even have to go as far as Simine or me. You just have to come into reading the ethnography with a realistic belief that problematic practices are at least at a high enough rate to be worrisome. And then the ethnography does what ethnographies do, and well in my judgment: it illustrates what these things look like, out there in the world, when they are happening.

In particular, I think a valuable part of Peterson’s ethnography is that it shows how problematic practices don’t just have to happen furtively by one person with the door closed. Instead, they can work their way into the fabric of how members of a lab talk and interact. When Leslie John et al. introduced the term questionable research practices, they defined it as “exploitation of the gray area of acceptable practice.” The Baby Factory gives us a view into how that can be a social process. Gray zones are by definition ambiguous; should we be shocked to find out that people working closely together will come to a socially shared understanding of them?

Another thing Peterson’s ethnography does is talk about the larger context where all this is happening, and try to interpret his observations in that context. He writes about the pressures for creating a public narrative of science that looks sharp and clean, about the need to make the most of very limited resources and opportunities, and about the very real challenges of working with babies (the “difficult research objects” of the subtitle). A commenter yesterday thought he came to the project with an axe to grind. But his interpretive framing was very sympathetic to the challenges of doing infant cognition research. And his concluding paragraphs were quite optimistic, suggesting that the practices he observed may be part of a “local culture” that has figured out how they can promote positive scientific development. I wish he’d developed that argument more. I don’t think infant cognition research has lacked for important scientific discoveries — but I would say it is in spite of the compromises researchers have sometimes had to make, not because of them.

I do think it would be a mistake to come away thinking this is something limited to infant cognition research. Peterson grounds his discussion in the specific challenges of studying babies, who have a habit of getting distracted or falling asleep or putting your stimuli in their mouths. Those particular problems may be distinctive to having babies as subjects, and I can understand why that framing might make baby researchers feel especially uncomfortable. But anybody who is asking big questions about the human mind is working with a difficult research object, and we all face the same larger pressures and challenges. There are some great efforts under way to understand the particular challenges of research practice and replicability in infant research, but whatever we learn from that is going to be about how broader problems are manifesting in a specific area. I don’t really see how you can fairly conclude otherwise.

An eye-popping ethnography of three infant cognition labs

I don’t know how else to put it. David Peterson, a sociologist, recently published an ethnographic study of 3 infant cognition labs. Titled “The Baby Factory: Difficult Research Objects, Disciplinary Standards, and the Production of Statistical Significance,” it recounts his time spend as a participant observer in those labs, attending lab meetings and running subjects.

In his own words, Peterson “shows how psychologists produce statistically significant results under challenging circumstances by using strategies that enable them to bridge the distance between an uncontrollable research object and a professional culture that prizes methodological rigor.” The account of how the labs try to “bridge the distance” reveals one problematic practice after another, in a way that sometimes makes them seem like normal practice and no big deal to the people in the labs. Here are a few examples.

Protocol violations that break blinding and independence:

…As a routine part of the experiments, parents are asked to close their eyes to prevent any unconscious influence on their children. Although this was explicitly stated in the instructions given to parents, during the actual experiment, it was often overlooked; the parents’ eyes would remain open. Moreover, on several occasions, experimenters downplayed the importance of having one’s eyes closed. One psychologist told a mother, “During the trial, we ask you to close your eyes. That’s just for the journals so we can say you weren’t directing her attention. But you can peek if you want to. It’s not a big deal. But there’s not much to see.”

Optional stopping based on data peeking:

Rather than waiting for the results from a set number of infants, experimenters began “eyeballing” the data as soon as babies were run and often began looking for statistical significance after just 5 or 10 subjects. During lab meetings and one-on-one discussions, experiments that were “in progress” and still collecting data were evaluated on the basis of these early results. When the preliminary data looked good, the test continued. When they showed ambiguous but significant results, the test usually continued. But when, after just a few subjects, no significance was found, the original protocol was abandoned and new variations were developed.

Invalid comparisons of significant to nonsignificant:

Because experiments on infant subjects are very costly in terms of both time and money, throwing away data is highly undesirable. Instead, when faced with a struggling experiment using a trusted experimental paradigm, experimenters would regularly run another study that had higher odds of success. This was accomplished by varying one aspect of the experiment, such as the age of the participants. For instance, when one experiment with 14-month-olds failed, the experimenter reran the same study with 18-month-olds, which then succeeded. Once a significant result was achieved, the failures were no longer valueless. They now represented a part of a larger story: “Eighteen-month-olds can achieve behavior X, but 14-month-olds cannot.” Thus, the failed experiment becomes a boundary for the phenomenon.

And HARKing:

When a clear and interesting story could be told about significant findings, the original motivation was often abandoned. I attended a meeting between a graduate student and her mentor at which they were trying to decipher some results the student had just received. Their meaning was not at all clear, and the graduate student complained that she was having trouble remembering the motivation for the study in the first place. Her mentor responded, “You don’t have to reconstruct your logic. You have the results now. If you can come up with an interpretation that works, that will motivate the hypothesis.”

A blunt explanation of this strategy was given to me by an advanced graduate student: “You want to know how it works? We have a bunch of half-baked ideas. We run a bunch of experiments. Whatever data we get, we pretend that’s what we were looking for.” Rather than stay with the original, motivating hypothesis, researchers in developmental science learn to adjust to statistical significance. They then “fill out” the rest of the paper around this necessary core of psychological research.

Peterson discusses all this in light of recent discussions about replicability and scientific practices in psychology. He says that researchers have basically 3 choices: limit the scope of your questions to what you can do well with available methods, relax our expectations of what a rigorous study looks like, or engage in QRPs. I think that is basically right. It is why I believe that any attempt to reduce QRPs has to be accompanied by changes to incentive structures, which govern the first two.

Peterson also suggests that QRPs are “becoming increasingly unacceptable.” That may be true in public discourse, but the inside view presented by his ethnography suggests that unless more open practices become standard, labs will continue to have lots of opportunity to engage in them and little incentive not to.

UPDATE: I discuss what all this means in a followup post: Reading “The Baby Factory” in context.

Three ways to approach the replicability discussion

There are 3 ways to approach the replicability discussion/debate in science.

#1 is as a logic problem. There are correct answers, and the challenge is to work them out. The goal is to be right.

#2 is as a culture war. There are different sides with different motives, values, or ideologies. Some are better than others. So the goal is win out over the other side.

#3 is as a social movement. Scientific progress is a shared value. Recently accumulated knowledge and technology have given us better ways to achieve it, but institutions and practices are slow to change. So the goal is to get everybody on board to make things better.

Probably all of us have elements of all three in us (and to be clear, all of us are trying to use reasoning to solve problems — the point of #1 is that’s the end goal so you stop there). But you see noticeable differences in which predominates in people’s public behavior.

My friends can probably guess which approach I feel most aligned with.

Bold changes at Psychological Science

Style manuals sound like they ought to be boring things, full of arcane points about commas and whatnot. But Wikipedia’s style manual has an interesting admonition: Be bold. The idea is that if you see something that could be improved, you should dive in and start making it better. Don’t wait until you are ready to be comprehensive, don’t fret about getting every detail perfect. That’s the path to paralysis. Wikipedia is an ongoing work in progress, your changes won’t be the last word but you can make things better.

In a new editorial at Psychological Science, interim editor Stephen Lindsay is clearly following the be bold philosophy. He lays out a clear and progressive set of principles for evaluating research. Beware the “troubling trio” of low power, surprising results, and just-barely-significant results. Look for signs of p-hacking. Care about power and precision. Don’t confuse nonsignificant for null.

To people who have been paying attention to the science reform discussion of the last few years (and its longstanding precursors), none of this is new. What is new is that an editor of a prominent journal has clearly been reading and absorbing the last few years’ wave of careful and thoughtful scholarship on research methods and meta-science. And he is boldly acting on it.

I mean, yes, there are some things I am not 100% in love with in that editorial. Personally, I’d like to see more value placed on good exploratory research.* I’d like to see him discuss whether Psychological Science will be less results-oriented, since that is a major contributor to publication bias.** And I’m sure other people have their objections too.***

But… Improving science will forever be a work in progress. Lindsay has laid out a set of principles. In the short term, they will be interpreted and implemented by humans with intelligence and judgment. In the longer term, someone will eventually look at what is and is not working and will make more changes.

Are Lindsay’s changes as good as they could possibly be? The answers are (1) “duh” because obviously no and (2) “duh” because it’s the wrong question. Instead let’s ask, are these changes better than things have been? I’m not going to give that one a “duh,” but I’ll stand behind a considered “yes.”

———-

* Part of this is because in psychology we don’t have nearly as good a foundation of implicit knowledge and accumulated wisdom for differentiating good from bad exploratory research as we do for hypothesis-testing. So exploratory research gets a bad name because somebody hacks around in a tiny dataset and calls it “exploratory research,” and nobody has the language or concepts to say why they’re doing it wrong. I hope we can fix that. For starters, we could start stealing more ideas from the machine learning and genomics people, though we will need to adapt them for the particular features of our scientific problems. But that’s a blog post for another day.

** There are some nice comments about this already on the ISCON facebook page. Dan Simons brought up the exploratory issue; Victoria Savalei the issue about results-focus. My reactions on these issues are in part bouncing off of theirs.

*** When I got to the part about using confidence intervals to support the null, I immediately had a vision of steam coming out of some of the Twitter Bayesians’ ears.

Moderator interpretations of the Reproducibility Project

The Reproducibility Project: Psychology (RPP) was published in Science last week. There has been some excellent coverage and discussion since then. If you haven’t heard about it,* Ed Yong’s Atlantic coverage will catch you up. And one of my favorite commentaries so far is on Michael Frank’s blog, with several very smart and sensible ways the field can proceed next.

Rather than offering a broad commentary, in this post I’d like to discuss one possible interpretation of the results of the RPP, which is “hidden moderators.” Hidden moderators are unmeasured differences between original and replication experiments that would result in differences in the true, underlying effects and therefore in the observed results of replications. Things like differences in subject populations and experimental settings. Moderator interpretations were the subject of a lengthy discussion on the ISCON Facebook page recently, and are the focus of an op-ed by Lisa Feldman Barrett.

In the post below, I evaluate the hidden-moderator interpretation. The tl;dr version is this: Context moderators are probably common in the world at large and across independently-conceived experiments. But an explicit design goal of direct replication is to eliminate them, and there’s good reason to believe they are rare in replications.

1. Context moderators are probably not common in direct replications

Many social and personality psychologists believe that lots of important effects vary by context out in the world at large. I am one of those people — subject and setting moderators are an important part of what I study in my own work. William McGuire discussed the idea quite eloquently, and it can be captured in an almost koan-like quote from Niels Bohr: “The opposite of one profound truth may very well be another profound truth.” It is very often the case that support for a broad hypothesis will vary and even be reversed over traditional “moderators” like subjects and settings, as well as across different methods and even different theoretical interpretations.

Consider as a thought experiment** what would happen if you told 30 psychologists to go test the deceptively simple-seeming hypothesis “Happiness causes smiling” and turned them loose. You would end up with 30 different experiments that would differ in all kinds of ways that we can be sure would matter: subjects (with different cultural norms of expressiveness), settings (e.g., subjects run alone vs. in groups), manipulations and measures of the IV (film clips, IAPS pictures, frontal asymmetry, PANAS?) and DV (FACS, EMG, subjective ratings?), and even construct definitions (state or trait happiness? eudaimonic or hedonic? Duchenne or social smiles?). You could learn a lot by studying all the differences between the experiments and their results.

But that stands in stark contrast to how direct replications are carried out, including the RPP. Replicators aren’t just turned loose with a broad hypothesis. In direct replication, the goal is to test the hypothesis “If I do the same experiment, I will get the same result.” Sometimes a moderator-ish hypothesis is built in (“this study was originally done with college students, will I get the same effect on Mturk?”). But such differences from the original are planned in. The explicit goal of replication design is for any other differences to be controlled out. Well-designed replication research makes a concerted effort to faithfully repeat the original experiments in every way that documentation, expertise, and common sense say should matter (and often in consultation with original authors too). The point is to squeeze out any room for substantive differences.

Does it work? In a word, yes. We now have data telling us that the squeezing can be very effective. In Many Labs 1 and Many Labs 3 (which I reviewed here), different labs followed standardized replication protocols for a series of experiments. In principle, different experimenters, different lab settings, and different subject populations could have led to differences between lab sites. But in analyses of heterogeneity across sites, that was not the result. In ML1, some of the very large and obvious effects (like anchoring) varied a bit in just how large they were (from “kinda big” to “holy shit”). Across both projects, more modest effects were quite consistent. Nowhere was there evidence that interesting effects wink in and out of detectability for substantive reasons linked to sample or setting.

We will continue to learn more as our field gets more experience with direct replication. But right now, a reasonable conclusion from the good, systematic evidence we have available is this: When some researchers write down a good protocol and other researchers follow it, the results tend to be consistent. In the bigger picture this is a good result for social psychology: it is empirical evidence that good scientific control is within our reach, neither beyond our experimental skills nor intractable for the phenomena we study.***

But it also means that when replicators try to clamp down potential moderators, it is reasonable to think that they usually do a good job. Remember, the Many Labs labs weren’t just replicating the original experiments (from which their results sometimes differed – more on that in a moment). They were very successfully and consistently replicating each other. There could be individual exceptions here and there, but on the whole our field’s experience with direct replication so far tells us that it should be unusual for unanticipated moderators to escape replicators’ diligent efforts at standardization and control.

2. A comparison of a published original and a replication is not a good way to detect moderators

Moderation means there is a substantive difference between 2 or more (true, underlying) effects as a function of the moderator variable. When you design an experiment to test a moderation hypothesis, you have to set things up so you can make a valid comparison. Your observations should ideally be unbiased, or failing that, the biases should be the same at different levels of the moderator so that they cancel out in the comparison.

With the RPP (and most replication efforts), we are trying to interpret observed differences between published original results and replications. The moderator interpretation rests on the assumption that observed differences between experiments are caused by substantive differences between them (subjects, settings, etc.). An alternative explanation is that there are different biases. And that is almost certainly the case. The original experiments are generally noisier because of modest power, and that noise is then passed through a biased filter (publication bias for sure — these studies were all published at selective journals — and perhaps selective reporting in some cases too). By contrast, the replications are mostly higher powered, the analysis plans were pre-registered, and the replicators committed from the outset to publish their findings no matter what the results.

That means that a comparison of published original studies and replication studies in the RPP is a poor way to detect moderators, because you are comparing a noisy and biased observation to one that is much less so.**** And such a comparison would be a poor way to detect moderators even if you were quite confident that moderators were out there somewhere waiting to be found.

3. Moderator explanations of the Reproducibility Project are (now) post hoc

The Reproducibility Project has been conducted with an unprecedented degree of openness. It was started 4 years ago. Both the coordinating plan and the protocols of individual studies were pre-registered. The list of selected studies was open. Original authors were contacted and invited to consult.

What that means is that anyone could have looked at an original study and a replication protocol, applied their expert judgment, and made a genuinely a priori prediction of how the replication results would have differed from the original. Such a prediction could have been put out in the open at any time, or it could have been pre-registered and embargoed so as not to influence the replication researchers.

Until last Friday, that is.

Now the results of the RPP are widely known. And although it is tempting to now look back selectively at “failed” replications and generate substantively interesting reasons, such explanations have to be understood for what they are: untested post hoc speculation. (And if someone now says they expected a failure all along, they’re possibly HARKing too.)

Now, don’t get me wrong — untested post hoc speculation is often what inspires new experiments. So if someone thinks they see an important difference between an original result and a replication and gets an idea for a new study to test it out, more power to them. Get thee to the lab.

But as an interpretation of the data we have in front of us now, we should be clear-eyed in appraising such explanations, especially as an across-the-board factor for the RPP. From a bargain-basement Bayesian perspective, context moderators in well-controlled replications have a low prior probability (#1 above), and comparisons of original and replication studies have limited evidential value because of unequal noise and bias (#2). Put those things together and the clear message is that we should be cautious about concluding that there are hidden moderators lurking everywhere in the RPP. Here and there, there might be compelling, idiosyncratic reasons to think there could be substantive differences to motivate future research. But on the whole, as an explanation for the overall pattern of findings, hidden moderators are not a strong contender.

Instead, we need to face up to the very well-understood and very real differences that we know about between published original studies and replications. The noxious synergy between low power and selective publication is certainly a big part of the story. Fortunately, psychology has already started to make changes since 2008 when the RPP original studies were published. And positive changes keep happening.

Would it be nice to think that everything was fine all along? Of course. And moderator explanations are appealing because they suggest that everything is fine, we’re just discovering limits and boundary conditions like we’ve always been doing.***** But it would be counterproductive if that undermined our will to continue to make needed improvements to our methods and practices. Personally, I don’t think everything in our past is junk, even post-RPP – just that we can do better. Constant self-critique and improvement are an inherent part of science. We have diagnosed the problem and we have a good handle on the solution. All of that makes me feel pretty good.

———

* Seriously though?

** A thought meta-experiment? A gedankengedankenexperiment?

*** I think if you asked most social psychologists, divorced from the present conversation about replicability and hidden moderators, they would already have endorsed this view. But it is nice to have empirical meta-scientific evidence to support it. And to show the “psychology isn’t a science” ninnies.

**** This would be true even if you believed that the replicators were negatively biased and consciously or unconsciously sandbagged their efforts. You’d think the bias was in the other direction, but it would still be unequal and therefore make comparisons of original vs. replication a poor empirical test of moderation. (You’d also be a pretty cynical person, but anyway.)

***** For what it’s worth, I’m not so sure that the hidden moderator interpretation would actually be all that reassuring under the cold light of a rational analysis. This is not the usual assumption that moderators are ubiquitous out in the world. We are talking about moderators that pop up despite concerted efforts to prevent them. Suppose that ubiquitous occult moderators were the sole or primary explanation for the RPP results — so many effects changing when we change so little, going from one WEIRD sample to another, with maybe 5-ish years of potential for secular drift, and using a standardized protocol. That would suggest that we have a poor understanding of what is going on in our labs. It would also suggest that it is extraordinarily hard to study main effects or try to draw even tentative generalizable conclusions about them. And given how hard it is to detect interactions, that would mean that our power problem would be even worse than people think it is now.