What if we talked about p-hacking the way we talk about experimenter effects?

Discussions about p-hacking sometimes go sideways. A hypothetical exchange might go like this:

READER: Those p-values are all hovering just below .05, I bet the authors p-hacked.

AUTHOR: I know that I did not p-hack, and I resent the accusation.

By comparison, consider how we talk about another form of potential bias: experimenter effects.

It is widely accepted that experimenters’ expectations, beliefs, or other characteristics can influence participants in behavioral experiments and medical trials. We also accept that this can happen without intent or even awareness on the part of the experimenter. Expectations about how participants receiving a treatment are supposed to differ from those receiving a placebo might show up in the experimenter’s behavior in subtle ways that could influence the participants.

We also don’t have a complete theory of experimenter effects that allows us to reliably measure every manifestation or predict with high confidence when they will and won’t occur. So instead, we consider them as an open possibility in a wide range of situations. As a result, it is also widely accepted that using procedural safeguards against experimenter effects is a best practice in most experiments where a human experimenter will interact with subjects.

Because of all these shared assumptions, discussions around experimenter effects are often much less heated. If you are presenting a study design at lab meeting, and someone says “you’ll need to keep your RAs blind to condition, here’s an idea how to do that…” that’s generally considered a helpful suggestion rather than an insinuation of planned malfeasance.

And even after a study is done, it is generally considered fair game to ask about blinding and other safeguards, and incorporate their presence or absence into an evaluation of a study. If a study lacks such safeguards, authors generally don’t say things like “I would never stoop so low as to try to influence my participants, how dare you!” Everybody, including authors, understands that experimenters don’t always know how they might be influencing subjects. And when safeguards are missing, readers typically treat it as a reason for doubt and uncertainty. We allow and even expect readers to calibrate that uncertainty judgment based on other assumptions or information, like how plausible the effect seems, how strong or weak did partial or incomplete safeguards seem, etc.

For some reason though, when it comes to potential sources of bias in data analysis, we have not (yet) reached a place where we can talk about it in a similar way. This is despite the fact that it has a lot in common with experimenter effects.

It is certainly possible for somebody to deliberately and strategically p-hack, just like it’s possible for an experimenter to wink and nudge and say “are you sure you’re not feeling better?” or whatever. But bias in data analysis does not have to happen that way. Analysts do not have to have intention or even awareness in order to do things that capitalize on chance.

Consider, first of all, that almost every data analysis involves many decisions: what data to include or exclude, whether or how to transform it, a zillion possibilities in specifying the analysis (what particular variables to look at, what analyses to run on them, whether to use one- or two-tailed tests, what covariates to include, which main, interactive, simple, or contrast effect[s] to treat as critical tests of the hypothesis, etc.), and then decisions about what to report. We psychologists of all people know that you cannot un-know something. So once the analyst has seen anything about the data – distributions, scatterplots, preliminary or interim analyses, whatever else – all the subsequent decisions will be made by a person who has that knowledge. And after that point, it is simply impossible for anybody – including the analyst – to state with any confidence how those decisions might otherwise have been made without that knowledge. Which means that we have to treat seriously the possibility that the analyst made decisions that overfit the analyses to the data.

More subtly, as Gelman and Loken discuss in their “forking paths” paper, bias is not defined by a behavior (how many analyses did you run?), but by a set of counterfactuals (how many analyses could you have run?). So even if the objective history is that one and only one analysis was run, that is not a guarantee of no bias.

What all of this means is that when it comes to bias in data analysis, we are in very much a similar situation as with experimenter effects. It is virtually impossible to measure or observe it happening in a single instance, even by the person doing the data analysis. But what we can do is define a broad set of circumstances where we have to take it seriously as a possibility.

It would be great if we could collectively shift our conversations around this issue. I think that would involve changes from both critical readers and from authors.

Start by considering procedures, not behavior or outcomes. Were safeguards in place, and if so, how effective were they? For bias in data analysis, the most common safeguard is preregistration. The mere existence of a preregistration (as indicated by a badge or an OSF link in a manuscript) tells you very little though – many of them do not actually constrain bias. Sometimes that is even by design (for example, preregistering an exploratory study is a great way to prevent editors or reviewers from pressuring you to HARK later on). A preregistration is just a transparency step, you have to actually read it to find out what it does. In order for a preregistration to prevent analytic bias, it has to do two things. First, it has to have a  decision inventory – that is, it has to identify all of the decisions about what data to collect/analyze, how to analyze it, and what to report. So ask yourself: is there a section on exclusions? Transformations? Does it say what the critical test is? Etc. (This will be easier to do in domains where you are familiar with the analytic workflow for the research area. It can also be aided by consulting templates. And if authors write and post analysis code as part of a preregistration, that can make things clear too.) Second, the preregistration has to have a plan for all of those decision points. To the extent that the inventory is complete and the plans are specific and were determined separate from the data, the preregistration can be an effective safeguard against bias.

When safeguards are missing or incomplete, everyone – authors and readers alike -should treat analytic bias as a serious possibility. If there is no preregistration or other safeguards, then bias is possible. If there is a preregistration but it was vague or incomplete, bias is also possible. In a single instance it is often impossible to know what actually happened, for the reasons I discussed above. It can be reasonable to start looking at indirect stuff like statistical evidence (like the distribution of p-values), whether the result is a priori implausible, etc. Inferences about these things should be made with calibrated uncertainty. p-curves are neither perfect nor useless; improbable things really do happen though by definition rarely; etc. So usually we should not be too sure in any direction.

Inferences about authors should be rare. We should have a low bar for talking about science and a high bar for talking about scientists. This cuts both ways. Casual talk challenging authors’ competence, intentions, unreported behaviors, etc. is often both hurtful and unjustified when we are talking about single papers.* But also, authors’ positive assertions about their character, behavior, etc. rarely shed light and can have the perverse effect of reinforcing the message that they, and not just the work, are a legitimate part of the conversation. As much as possible, make all the nouns in your discussion things like “the results,” “the procedure,” etc. and not “the authors” (or for that matter “my critics”). And whether you are an author, a critic, or even an observer, you can point out when people are talking about authors and redirect the conversation to the work.

I realize this last item draws a razor-thin line and maybe sometimes it is no line at all. After all, things like what safeguards were in place, and what happened if they weren’t, are results of the researcher’s behavior. So even valid criticism implicates what the authors did or didn’t do, and it will likely be personally uncomfortable for them. But it’s a distinction that’s worth observing as much as you can when you criticize work or respond to criticisms. And I would hope we’ve learned from the ways we talk about experimenter effects that it is possible to have less heated, and frankly more substantive, discussions about bias when we do that.

Finally, it is worth pointing out that preregistration and other safeguards are still really new to psychology and many other scientific fields. We are all still learning, collectively, how to do them well. That means that we need to be able to criticize them openly, publicly, and vigorously – if we do not talk about them, we cannot get better at doing them. But it also means that some preregistration is almost always better than none, because even a flawed or incomplete one will increase transparency and make it possible to criticize work more effectively. Even as we critique preregistrations that could have been done better, we should recognize that anybody who makes that critique and improvement possible has done something of value.


* In the bigger picture, for better or worse, science pins career advancement, resources, prestige, etc. to people’s reputations. So at some point we have to be able to talk about these things. This is a difficult topic and not something I want to get into here, other than to say that discussions about who is a good scientist are probably better left to entirely separate conversations from ones where we scientifically evaluate single papers, because the evidentiary standards and consequences are so different.

Accountable replications at Royal Society Open Science: A model for scientific publishing

Kintsugi pottery

Kintsugi pottery. Source: Wikimedia commons.

Six years ago I floated an idea for scientific journals that I nicknamed the Pottery Barn Rule. The name is a reference to an apocryphal retail store policy captured in the phrase, “you break it, you bought it.” The idea is that if you pick something up in the store, you are responsible for what happens to it in your hands.* The gist of the idea in that blog post was as follows: “Once a journal has published a study, it becomes responsible for publishing direct replications of that study. Publication is subject to editorial review of technical merit but is not dependent on outcome.”

The Pottery Barn framing was somewhat lighthearted, but a more serious inspiration for it (though I don’t think I emphasized this much at the time) was newspaper correction policies. When news media take a strong stance on vetting reports of errors and correcting the ones they find, they are more credible in the long run. The good ones understand that taking a short-term hit when they mess up is part of that larger process.**

The core principle driving the Pottery Barn Rule is accountability. When a peer-reviewed journal publishes an empirical claim, its experts have judged the claim to be sound enough to tell the world about. A journal that adopts the Pottery Barn Rule is signaling that it stands behind that judgment. If a finding was important enough to publish the first time, then it is important enough to put under further scrutiny, and the journal takes responsibility to tell the world about those efforts too.

In the years since the blog post, a few journals have picked up the theme in their policies or practices. For example, the Journal of Research in Personality encourages replications and offers an abbreviated review process for studies it had published within the last 5 years. Psychological Science offers a preregistered direct replications submission track for replications of work they’ve published. And Scott Lilienfeld has announced that Clinical Psychological Science will follow the Pottery Barn Rule. In all three instances, these journals have led the field in signaling that they take responsibility for publishing replications.

And now, thanks to hard work by Chris Chambers,*** I was excited to read this morning that the journal Royal Society Open Science has announced a new replication policy that is the fullest implementation yet. Other journals should view the RSOS policy as a model for adoption. In the new policy, RSOS is committing to publishing any technically sound replication of any study it has published, regardless of the result, and providing a clear workflow for how it will handle such studies.

What makes the RSOS policy stand out? Accountability means tying your hands – you do not get to dodge it when it will sting or make you look bad. Under the RSOS policy, editors will still judge the technical faithfulness of replication studies. But they cannot avoid publishing replications on the basis of perceived importance or other subjective factors. Rather, whatever determination the journal originally made about those subjective questions at the time of the initial publication is applied to the replication. Making this a firm commitment, and having it spelled out in a transparent written policy, means that the scientific community knows where the journal stands and can easily see if the journal is sticking to its commitment. Making it a written policy (not just a practice) also means it is more likely to survive past the tenure of the current editors.

Such a policy should be a win both for the journal and for science. For RSOS – and for authors that publish there – articles will now have the additional credibility that comes from a journal saying it will stand by the decision. For science, this will contribute to a more complete and less biased scientific record.

Other journals should now follow suit. Just as readers would trust a news source more if they are transparent about corrections — and less if they leave it to other media to fix their mistakes — readers should have more trust in journals that view replications of things they’ve published as their responsibility, rather than leaving them to other (often implicitly “lesser”) journals. Adopting the RSOS policy, or one like it, will be way for journals to raise the credibility of the work that they publish while they make scientific publishing more rigorous and transparent.


* In reality, the actual Pottery Barn store will quietly write off the breakage as a loss and let you pretend it never happened. This is probably not a good model to emulate for science.

** One difference is that because newspapers report concrete facts, they work from a presumption that they got those things right, and they only publish corrections for errors. Whereas in science, uncertainty looms much larger in our epistemology. We draw conclusions from the accumulation of statistical evidence, so the results of all verification attempts have value regardless of outcome. But the common theme across both domains is being accountable for the things you have reported.

*** You may remember Chris from such films as registered reports and stop telling the world you can cure cancer because of seven mice.

The replication price-drop in social psychology

Why is the replication crisis centered on social psychology? In a recent post, Andrew Gelman offered a list of possible reasons. Although I don’t agree with every one of his answers (I don’t think data-sharing is common in social psych for example), it is an interesting list of ideas and an interesting question.

I want to riff on one of those answers, because it is something I’ve been musing about for a while. Gelman suggests that in social psychology, experiments are comparatively easy and cheap to replicate. Let’s stipulate that this is true of at least some parts of social psych. (Not necessarily all of them – I’ll come back to that.) What would easy and cheap replications do for a field? I’d suggest they have two, somewhat opposing effects.

On the one hand, running replications is the most straightforward way to obtain evidence about whether an effect is replicable.1 So the easier it is to run a replication, the easier it will be to discover if a result is a fluke. Broaden that out, and if a field has lots of replicability problems and replications are generally easy to run, it should be easier to diagnose the field.

But on the other hand, in a field or area where it is easy to run replications, that should facilitate a scientific ecosystem where unreplicable work can get weeded out. So over time, you might expect a field to settle into an equilibrium where by routinely running those easy and cheap replications, it is keeping unreplicable work at a comfortably low rate.2 No underlying replication problem, therefore no replication crisis.

The idea that I have been musing on for a while is that “replications are easy and cheap” is a relatively new development in social psychology, and I think that may have some interesting implications. I tweeted about it a while back but I thought I’d flesh it out.

Consider that until around a decade ago, almost all social psychology studies were run in person. You might be able to do a self-report questionnaire study in mass testing sessions, but a lot of experimental protocols could only be run a few subjects at a time. For example, any protocol that involved interaction or conditional logic (i.e., couldn’t just be printed on paper for subjects to read) required live RAs to run the show. A fair amount of measurement was analog and required data entry. And computerized presentation or assessment was rate-limited by the number of computers and cubicles a lab owned. All of this meant that even a lot of relatively simpler experiments required a nontrivial investment of labor and maybe money. And a lot of those costs were per-subject costs, so they did not scale up well.

All of this changed only fairly recently, with the explosion of internet experimentation. In the early days of the dotcom revolution you had to code websites yourself,3 but eventually companies like Qualtrics sprung up with pretty sophisticated and usable software for running interactive experiments. That meant that subjects could complete many kinds of experiments at home without any RA labor to run the study. And even for in-lab studies, a lot of data entry – which had been a labor-intensive part of running even a simple self-report study – was cut out. (Or if you were already using experiment software, you went from having to buy a site license for every subject-running computer to being able to run it on any device with a browser, even a tablet or phone.) And Mechanical Turk meant that you could recruit cheap subjects online in large numbers and they would be available virtually instantly.

All together, what this means is that for some kinds of experiments in some areas of psychology, replications have undergone a relatively recent and sizeable price drop. Some kinds of protocols pretty quickly went from something that might need a semester and a team of RAs to something you could set up and run in an afternoon.4 And since you weren’t booking up your finite lab space or spending a limited subject-pool allocation, the opportunity costs got lower too.

Notably, growth of all of the technologies that facilitated the price-drop accelerated right around the same time as the replication crisis was taking off. Bem, Stapel, and false-positive psychology were all in 2011. That’s the same year that Buhrmester et al published their guide to running experiments on Mechanical Turk, and just a year later Qualtrics got a big venture capital infusion and started expanding rapidly.

So my conjecture is that the sudden price drop helped shift social psychology out of a replications-are-rare equilibrium and moved it toward a new one. In pretty short order, experiments that previously would have been costly to replicate (in time, labor, money, or opportunity) got a lot cheaper. This meant that there was a gap between the two effects of cheap replications I described earlier: All of a sudden it was easy to detect flukes, but there was a buildup of unreplicable effects in the literature from the old equilibrium. This might explain why a lot of replications in the early twenty-teens were social priming5 studies and similar paradigms that lend themselves to online experimentation pretty well.

To be sure, I don’t think this could by any means be a complete theory. It’s more of a facilitating change along with other factors. Even if replications are easy and cheap, researchers still need to be motivated to go and run them. Social psychology had a pretty strong impetus to do that in 2011, with Bem, Stapel, and False-positive psychology all breaking in short order. And as researchers in social psychology started finding cause for concern in those newly-cheap studies, they were motivated to widen their scope to replicating other studies that had been designed, analyzed, and reported in similar ways but that hadn’t had so much of a price-drop.

To date that broadening-out from the easy and cheap studies hasn’t spread nearly as much to other subfields like clinical psychology or developmental psychology. Perhaps there is a bit of an ingroup/outgroup dynamic – it is easier to say “that’s their problem over there” than to recognize commonalities. And those fields don’t have a bunch of cheap-but-influential studies of their own to get them started internally.6

An optimistic spin on all this is that social psychology could be be on its way to a new equilibrium where running replications becomes more of a normal thing. But there will need to be an accompanying culture shift where researchers get used to seeing replications as part of mainstream scientific work.

Another implication is that the price-drop and resulting shift in equilibrium has created a kind of natural experiment where the weeding-out process has lagged behind the field’s ability to run cheap replications. A boom in metascience research has taken advantage of this lag to generate insights into what does7 and doesn’t8 make published findings less likely to be replicated. Rather than saying “oh that’s those people over there,” fields and areas where experiments are difficult and expensive could and should be saying, wow, we could have a problem and not even know it – but we can learn some lessons from seeing how “those people over there” discovered they had a problem and what they learned about it.


  1. Hi, my name is Captain Obvious. 
  2. Conversely, it is possible that a field where replications are hard and expensive might reach an equilibrium where unreplicable findings could sit around uncorrected. 
  3. RIP the relevance of my perl skills. 
  4. Or let’s say a week + an afternoon if you factor in getting your IRB exemption. 
  5. Yeah, I said social priming without scare quotes. Come at me. 
  6. Though admirably, some researchers in those fields are now trying anyway, costs be damned. 
  7. Selective reporting of underpowered results
  8. Hidden moderators

Reflections on SIPS (guest post by Neil Lewis, Jr.)

The following is a guest post by Neil Lewis, Jr. Neil is an assistant professor at Cornell University.

Last week I visited the Center for Open Science in Charlottesville, Virginia to participate in the second annual meeting of the Society for the Improvement of Psychological Science (SIPS). It was my first time going to SIPS, and I didn’t really know what to expect. The structure was unlike any other conference I’ve been to—it had very little formal structure—there were a few talks and workshops here and there, but the vast majority of the time was devoted to “hackathons” and “unconference” sessions where people got together and worked on addressing pressing issues in the field: making journals more transparent, designing syllabi for research methods courses, forming a new journal, changing departmental/university culture to reward open science practices, making open science more diverse and inclusive, and much more. Participants were free to work on whatever issues we wanted to and to set our own goals, timelines, and strategies for achieving those goals.

I spent most of the first two days at the diversity and inclusion hackathon that Sanjay and I co-organized. These sessions blew me away. Maybe we’re a little cynical, but going into the conference we thought maybe two or three people would stop by and thus it would essentially be the two of us trying to figure out what to do to make open science more diverse and inclusive. Instead, we had almost 40 people come and spend the first day identifying barriers to diversity and inclusion, and developing tools to address those barriers. We had sub-teams working on (1) improving measurement of diversity statistics (hard to know how much of a diversity problem one has if there’s poor measurement), (2) figuring out methods to assist those who study hard-to-reach populations, (3) articulating the benefits of open science and resources to get started for those who are new, (4) leveraging social media for mentorship on open science practices, and (5) developing materials to help PIs and institutions more broadly recruit and retain traditionally underrepresented students/scholars. Although we’re not finished, each team made substantial headway in each of these areas.

On the second day, those teams continued working, but in addition we had a “re-hack” that allowed teams that were working on other topics (e.g., developing research methods syllabi, developing guidelines for reviewers, starting a new academic journal) to present their ideas and get feedback on how to make their projects/products more inclusive from the very beginning (rather than having diversity and inclusion be an afterthought as is often the case). Once again, it was inspiring to see how committed people were to making sure so many dimensions of our science become more inclusive.

These sessions, and so many others at the conference, gave me a lot of hope for the field—hope that I (and I suspect others) could really use (special shout-outs to Jessica Flake’s unconference on improving measurement, Daniel Lakens and Jeremy Biesanz’s workshop on sample size and effect size, and Liz Page-Gould and Alex Danvers’s workshop on Fundamentals of R for data analysis). It’s been a tough few years to be a scientist. I was working on my PhD in social psychology at the time that the open science collaborative published their report estimating the reproducibility of psychological science to be somewhere between one-third and one-half. Then a similar report came out about the state of cancer research – only twenty five percent of papers replicated there. Now it seems like at least once a month there is some new failed replication study or some other study comes out that has major methodological flaw(s). As someone just starting out, constantly seeing findings I learned were fundamental fail to replicate, and new work emerge so flawed, I often find myself wondering (a) what the hell do we actually know, and (b) if so many others can’t get it right, what chance do I have?

Many Big Challenges with No Easy Solutions

To try and minimize future fuck-ups in my own work, I started following a lot of methodologists on Twitter so that I could stay in the loop on what I need to do to get things right (or at least not horribly wrong). There are a lot of proposed solutions out there (and some argument about those solutions, e.g., p < .005) but there are some big ones that seem to have reached consensus, including vastly increasing the size of our samples to increase the reliability of findings. These solutions make sense for addressing the issues that got us to this point, but the more I’ve thought about and talked to others about them, the more it became clear that some may unintentionally create another problem along the way, which is to “crowd out” some research questions and researchers. For example, when talking with scholars who study hard-to-reach populations (e.g., racial and sexual minorities), a frequently voiced concern is that it is nearly impossible to recruit the sample sizes needed to meet new thresholds of evidence.

To provide an example from my own research, I went to graduate school intending to study racial-ethnic disparities in academic outcomes (particularly Black-White achievement gaps). In my first semester at the University of Michigan I asked my advisor to pay for a pre-screen of the department of psychology’s participant pool to see how many Black students I would have to work with if I pursued that line of research. There were 42 Black students in the pool that semester. Forty-two. Out of 1,157. If memory serves me well, that was actually one of the highest concentrations of Black students in the pool in my entire time there. Seeing that, I asked others who study racial minorities what they did. I learned that unless they had well-funded advisors that could afford to pay for their samples, many either shifted their research questions to topics that were more feasible to study, or they would spend their graduate careers collecting data for one or two studies. In my area, that latter approach was not practical for being employable—professional development courses taught us that search committees expect multiple publications in the flagship journals, and those flagship journals usually require multiple studies for publication.

Learning about those dynamics, I temporarily shifted my research away from racial disparities until I figured out how to feasibly study those topics. In the interim, I studied other topics where I could recruit enough people to do the multi-study papers that were expected. That is not to say I am uninterested in those other topics I studied (I very much am) but disparities were what interested me most. Now, some may read that and think ‘Neil, that’s so careerist of you! You should have pursued the questions you were most passionate about, regardless of how long it took!’ And on an idealistic level, I agree with those people. But on a practical level—I have to keep a roof over my head and eat. There was no safety net at home if I was unable to get a job at the end of the program. So I played it safe for a few years before going back to the central questions that brought me to academia in the first place.

That was my solution. Others left altogether. As one friend depressingly put it—“there’s no more room for people like us; unless we get lucky with the big grants that are harder and harder to get, we can’t ask our questions—not when power analyses now say we need hundreds per cell; we’ve been priced out of the market.” And they’re not entirely wrong. Some collaborators and I recently ran a survey experiment with Black American participants; it was a 20-minute survey with 500 Black Americans. That one study cost us $11,000. Oh, and it’s a study for a paper that requires multiple studies. The only reason we can do this project is because we have a senior faculty collaborator who has an endowed chair and hence deep research pockets.

So that is the state of affairs. The goal post keeps shifting, and it seems that those of us who already had difficulty asking our questions have to choose between pursuing the questions we’re interested in, and pursuing questions that are practical for keeping roofs over our heads (e.g., questions that can be answered for $0.50 per participant on MTurk). And for a long time this has been discouraging because it felt as though those who have been leading the charge on research reform did not care. An example that reinforces this sentiment is a quote that floated around Twitter just last week. A researcher giving a talk at a conference said “if you’re running experiments with low sample n, you’re wasting your time. Not enough money? That’s not my problem.”

That researcher is not wrong. For all the reasons methodologists have been writing about for the past few years (and really, past few decades), issues like small sample sizes do compromise the integrity of our findings. At the same time, I can’t help but wonder about what we lose when the discussion stops there, at “that’s not my problem.” He’s right—it’s not his personal problem. But it is our collective problem, I think. What questions are we missing out on when we squeeze out those who do not have the thousands or millions of dollars it takes to study some of these topics? That’s a question that sometimes keeps me up at night, particularly the nights after conversations with colleagues who have incredibly important questions that they’ll never pursue because of the constraints I just described.

A Chance to Make Things Better

Part of what was so encouraging about SIPS was that we not only began discussing these issues, but people immediately took them seriously and started working on strategies to address them—putting together resources on “small-n designs” for those who can’t recruit the big samples, to name just one example. I have never seen issues of diversity and inclusion taken so seriously anywhere, and I’ve been involved in quite a few diversity and inclusion initiatives (given the short length of my career). At SIPS, people were working tirelessly to make actionable progress on these issues. And again, it wasn’t a fringe group of women and minority scholars doing this work as is so often the case—we had one of the largest hackathons at the conference. I really wish more people were there witness it—it was amazing, and energizing. It was the best of science—a group of committed individuals working incredibly hard to understand and address some of the most difficult questions that are still unanswered, and producing practical solutions to pressing social issues.

Now it is worth noting that I had some skepticism going into the conference. When I first learned about it I went back-and-forth on whether I should go; and even the week before the conference, I debated canceling the trip. I debated canceling because there was yet another episode of the “purely hypothetical scenario” that Will Gervais described in his recent blog post:

A purely hypothetical scenario, never happens [weekly+++]

Some of the characters from that scenario were people I knew would be attending the conference. I was so disgusted watching it unfold that I had no desire to interact with them the following week at the conference. My thought as I watched the discourse was: if it is just going to be a conference of the angry men from Twitter where people are patted on the back for their snark, using a structure from the tech industry—an industry not known for inclusion, then why bother attend? Apparently, I wasn’t alone in that thinking. At the diversity hackathon we discussed how several of us invited colleagues to come who declined because, due to their perceptions of who was going to be there and how those people often engage on social media, they did not feel it was worth their time.

I went despite my hesitation and am glad I did—it was the best conference I’ve ever attended. The attendees were not only warm and welcoming in real life, they also seemed to genuinely care about working together to improve our science, and to improve it in equitable and inclusive ways. They really wanted to hear what the issues are, and to work together to solve them.

If we regularly engage with each other (both online and face-to-face) in the ways that participants did at SIPS 2017, the sky is the limit for what we can accomplish together. The climate in that space for those few days provided the optimal conditions for scientific progress to occur. People were able to let their guards down, to acknowledge that what we’re trying to do is f*cking hard and that none of us know all the answers, to admit and embrace that we will probably mess up along the way, and that’s ok. As long as we know more and are doing better today than we knew and did yesterday, we’re doing ok – we just have to keep pushing forward.

That approach is something that I hope those who attended can take away, and figure out how to replicate in other contexts, across different mediums of communication (particularly online). I think it’s the best way to do, and to improve, our science.

I want to thank the organizers for all of the work they put into the conference. You have no idea how much being in that setting meant to me. I look forward to continuing to work together to improve our science, and hope others will join in this endeavor.

Improving Psychological Science at SIPS

Last week was the second meeting of the Society for the Improvement of Psychological Science, a.k.a. SIPS[1]. SIPS is a service organization with the mission of advancing and supporting all of psychological science. About 200 people met in Charlottesville, VA to participate in hackathons and lightning talks and unconference sessions, go to workshops, and meet other people interested in working to improve psychology.

What Is This Thing Called SIPS?

If you missed SIPS and are wondering what happened – or even if you were there but want to know more about the things you missed – here are a few resources I have found helpful:

The conference program gives you an overview and the conference OSF page has links to most of what went on, though it’s admittedly a lot to dig through. For an easier starting point, Richie Lennie posted an email he wrote to his department with highlights and links, written specifically with non-attendees in mind.

Drilling down one level from the conference OSF page, all of the workshop presenters put their materials online. I didn’t make it to any workshops so I appreciate having access to those resources. One good example is Simine Vazire and Bobbie Spellman’s workshop on writing transparent and reproducible articles. Their slideshow shows excerpts from published papers on things like how to transparently report exploratory analyses, how to report messy results, how to interpret a null result, and more. For me, writing is a lot easier when I have examples and models to work from, and I expect that I will be referring to those in the future.

The list of hackathon OSF pages is worth browsing. Hackathons are collaborative sessions for people interested in working on a defined project. Organizers varied in how much they used OSF – some used them mainly for internal organization, while others hosted finished or near-finished products on them. A standout example of the latter category is from the graduate research methods course hackathon. Their OSF wiki has a list of 31 topics, almost all of which are live links to pages with learning goals, reading lists, demonstrations, and assignments. If you teach grad research methods, or anything else with methodsy content, go raid the site for all sorts of useful materials.

The program also had space for smaller or less formal events. Unconferences were spontaneously organized sessions, some of which grew into bigger projects. Lightning talks were short presentations, often about work in progress.

As you browse through the resources, it is also worth keeping in the back of your mind that many projects get started at SIPS but not finished there, so look for more projects to come to fruition in the weeks and months ahead.

A challenge for future SIPS meetings is going to be figuring out how to reach beyond the people physically attending the meeting and get the broadest possible engagement, as well as to support dissemination of projects and initiatives that people create at SIPS. We have already gotten some valuable feedback about how other hackathons and unconferences manage that. This year’s meeting happened because of a Herculean effort by a very small group of volunteers[2] operating on a thin budget (at one point it was up in the air whether there’d be even wifi in the meeting space, if you can believe it) who had to plan an event that doubled in size from last year. As we grow we will always look for more and better ways to engage – the I in SIPS would not count for anything if the society did not apply it to itself.

My Personal Highlights

It is hard to summarize but I will mention a few highlights from things that I saw or participated in firsthand.

Neil Lewis Jr. and I co-organized a hackathon on diversity and inclusion in open science. We had so many people show up that we eventually split into five smaller groups working on different projects. My group worked on helping SIPS-the-organization start to collect member data so it can track how it is doing with respect to its diversity and inclusion goals. I posted a summary on the OSF page and would love to get feedback. (Neil is working on a guest post, so look for more here about that hackathon in the near future.)

Another session I participated in was the “diversity re-hack” on day two. The idea was that diversity and inclusion are relevant to everything, not just what comes up at a hackathon with “diversity and inclusion” in the title. So people who had worked on all the other hackathons on day one could come and workshop their in-progress projects to make them serve those goals even better. It was another well-attended session and we had representatives from nearly every hackathon group come to participate.

Katie Corker was the first recipient of the society’s first award, the SIPS Leadership Award. Katie has been instrumental in the creation of the society and in organizing the conference, and beyond SIPS she has also been a leader in open science in the academic community. Katie is a dynamo and deserves every bit of recognition she gets.

It was also exciting to see projects that originated at the 2016 SIPS meeting continuing to grow. During the meeting, APA announced that it will designate PsyArXiv as its preferred preprint server. And the creators of StudySwap, which also came out of SIPS 2016, just announced an upcoming Nexus (a fancy term for what we called “special issue” in the print days) with the journal Collabra: Psychology on crowdsourced research.

Speaking of which, Collabra: Psychology is now the official society journal of SIPS. It is fitting that SIPS partnered with an open-access journal, given the society’s mission. SIPS will oversee editorial responsibilities and the scientific mission of the journal, while the University of California Press will operate as the publisher.

But probably the most gratifying thing for me about SIPS was meeting early-career researchers who are excited about making psychological science more open and transparent, more rigorous and self-correcting, and more accessible and inclusive of everyone who wants to do science or could benefit from science. The challenges can sometimes feel huge, and I found it inspiring and energizing to spend time with people just starting out in the field who are dedicated to facing them.

*****

1. Or maybe it was the first meeting, since we ended last year’s meeting with a vote on whether to become a society, even though we were already calling ourselves that? I don’t know, bootstrapping is weird.

2. Not including me. I am on the SIPS Executive Committee so I got to see up close the absurd amount of work that went into making the conference. Credit for the actual heavy lifting goes to Katie Corker and Jack Arnal, the conference planning committee who made everything happen with the meeting space, hotel, meals, and all the other logistics; and the program committee of Brian Nosek, Michèle Nuijten, John Sakaluk, and Alexa Tullett, who were responsible for putting together the scientific (and, uh, I guess meta-scientific?) content of the conference.

Learning exactly the wrong lesson

For several years now I have heard fellow scientists worry that the dialogue around open and reproducible science could be used against science – to discredit results that people find inconvenient and even to de-fund science. And this has not just been fretting around the periphery. I have heard these concerns raised by scientists who hold policymaking positions in societies and journals.

A recent article by Ed Yong talks about this concern in the present political climate.

In this environment, many are concerned that attempts to improve science could be judo-flipped into ways of decrying or defunding it. “It’s been on our minds since the first week of November,” says Stuart Buck, Vice President of Research Integrity at the Laura and John Arnold Foundation, which funds attempts to improve reproducibility.

The worry is that policy-makers might ask why so much money should be poured into science if so many studies are weak or wrong? Or why should studies be allowed into the policy-making process if they’re inaccessible to public scrutiny? At a recent conference on reproducibility run by the National Academies of Sciences, clinical epidemiologist Hilda Bastian says that she and other speakers were told to consider these dangers when preparing their talks.

One possible conclusion is that this means we should slow down science’s movement toward greater openness and reproducibility. As Yong writes, “Everyone I spoke to felt that this is the wrong approach.” But as I said, those voices are out there and many could take Yong’s article as reinforcing their position. So I think it bears elaboration why that would be the wrong approach.

Probably the least principled reason, but an entirely unavoidable practical one, is just that it would be impossible. The discussion cannot be contained. Notwithstanding some defenses of gatekeeping and critiques of science discourse on social media (where much of this discussion is happening), there is just no way to keep scientists from talking about these issues in the open.

And imagine for a moment that we nevertheless tried to contain the conversation. Would that be a good idea? Consider the “climategate” faux-scandal. Opponents of climate science cooked up an anti-transparency conspiracy out of a few emails that showed nothing of the sort. Now imagine if we actually did that – if we kept scientists from discussing science’s problems in the open. And imagine that getting out. That would be a PR disaster to dwarf any misinterpretation of open science (because the worst PR disasters are the ones based in reality).

But to me, the even more compelling consideration is that if we put science’s public image first, we are inverting our core values. The conversation around open and reproducible science cuts to fundamental questions about what science is – such as that scientific knowledge is verifiable, and that it belongs to everyone – and why science offers unique value to society. We should fully and fearlessly engage in those questions and in making our institutions and practices better. We can solve the PR problem after that. In the long run, the way to make the best possible case for science is to make science the best possible.

Rather than shying away from talking about openness and reproducibility, I believe it is more critical than ever that we all pull together to move science forward. Because if we don’t, others will make changes in our name that serve other agendas.

For example, Yong’s article describes a bill pending in Congress that would set impossibly high standards of evidence for the Environmental Protection Agency to base policy on. Those standards are wrapped in the rhetoric of open science. But as Michael Eisen says in the article, “It won’t produce regulations based on more open science. It’ll just produce fewer regulations.” This is almost certainly the intended effect.

As long as scientists – individually and collectively in our societies and journals – drag our heels on making needed reforms, there will be a vacuum that others will try to fill. Turn that around, and the better the scientific community does its job of addressing openness and transparency in the service of actually making science do what science is supposed to do – making it more open, more verifiable, more accessible to everyone – the better positioned we will be to rebut those kinds of efforts by saying, “Nope, we got this.”

False-positive psychology five years later

Joe Simmons, Leif Nelson, and Uri Simonsohn have written a 5-years-later[1] retrospective on their “false-positive psychology” paper. It is for an upcoming issue of Perspectives on Psychological Science dedicated to the most-cited articles from APS publications. A preprint is now available.

It’s a short and snappy read with some surprises and gems. For example, footnote 2 notes that the Journal of Consumer Research declined to adopt their disclosure recommendations because they might “dull … some of the joy scholars may find in their craft.” No, really.

For the youngsters out there, they do a good job of capturing in a sentence a common view of what we now call p-hacking: “Everyone knew it was wrong, but they thought it was wrong the way it’s wrong to jaywalk. We decided to write ‘False-Positive Psychology’ when simulations revealed it was wrong the way it’s wrong to rob a bank.”[2]

The retrospective also contains a review of how the paper has been cited in 3 top psychology journals. About half of the citations are from researchers following the original paper’s recommendations, but typically only a subset of them. The most common citation practice is to justify having barely more than 20 subjects per cell, which they now describe as a “comically low threshold” and take a more nuanced view on.

But to me, the most noteworthy passage was this one because it speaks to institutional pushback on the most straightforward of their recommendations:

Our paper has had some impact. Many psychologists have read it, and it is required reading in at least a few methods courses. And a few journals – most notably, Psychological Science and Social Psychological and Personality Science – have implemented disclosure requirements of the sort that we proposed (Eich, 2014; Vazire, 2015). At the same time, it is worth pointing out that none of the top American Psychological Association journals have implemented disclosure requirements, and that some powerful psychologists (and journal interests) remain hostile to costless, common sense proposals to improve the integrity of our field.

Certainly there are some small refinements you could make to some of the original paper’s disclosure recommendations. For example, Psychological Science requires you to disclose all variables “that were analyzed for this article’s target research question,” not all variables period. Which is probably an okay accommodation for big multivariate studies with lots of measures.[3]

But it is odd to be broadly opposed to disclosing information in scientific publications that other scientists would consider relevant to evaluating the conclusions. And yet I have heard these kinds of objections raised many times. What is lost by saying that researchers have to report all the experimental conditions they ran, or whether data points were excluded and why? Yet here we are in 2017 and you can still get around doing that.

 


1. Well, five-ish. The paper came out in late 2011.

2. Though I did not have the sense at the time that everyone knew about everything. Rather, knowledge varied: a given person might think that fiddling with covariates was like jaywalking (technically wrong but mostly harmless), that undisclosed dropping of experimental conditions was a serious violation, but be completely oblivious to the perils of optional stopping. And a different person might have had a different constellation of views on the same 3 issues.

3. A counterpoint is that if you make your materials open, then without clogging up the article proper, you allow interested readers to go and see for themselves.