Will this time be different?

I had the honor to deliver the closing address to the Society for the Improvement of Psychological Science on July 9, 2019 in Rotterdam. The following are my prepared remarks. (These remarks are also archived on PsyArXiv.)

Some years ago, not long after people in psychology began talking in earnest about a replication crisis and what to do about it, I was talking with a colleague who has been around the field for longer than I have. He said to me, “Oh, this is all just cyclical. Psychology goes through a bout of self-flagellation every decade or two. It’ll be over soon and nothing will be different.”

I can’t say I blame him. Psychology has had other periods of reform that have fizzled out. One of the more recent ones was the statistical reform effort of the late 20th century – and you should read Fiona Fidler’s history of it, because it is completely fascinating. Luminaries like Jacob Cohen, Paul Meehl, Robert Rosenthal, and others were members and advisors of a blue-ribbon APA task force to change the practice of statistics in psychology. This resulted in the APA style manual adding a requirement to report effect sizes – one which is occasionally even followed, though the accompanying call to interpret effect sizes has gotten much less traction – and a few other modest improvements. But it was nothing like the sea change that many of them believed was needed.

But flash forward to the current decade. When people ask themselves, “Will this time be different?” it is fair to say there is a widespread feeling that indeed it could be. There is no single reason. Instead, as Bobbie Spellman and others have written, it is a confluence of contributing factors.

One of them is technology. The flow of scientific information is no longer limited by what we can print on sheets of pulped-up tree guts bound together into heavy volumes and sent by truck, ship, and airplane around the world. Networked computing and storage means that we can share data, materials, code, preprints, and more, at a quantity and speed that was barely imagined even a couple of decades ago when I was starting graduate school. Technology has given scientists far better ways to understand and verify the work we are building on, collaborate, and share what we have discovered.

A second difference is that more people now view the problem not just as an analytic one – the domain of logicians and statisticians – but also, complementarily, as a human one. So, for example, the statistical understanding of p-values as a function of a model and data has been married to a social-scientific understanding: p-values are also a function of the incentives and institutions that the people calculating them are working under. We see meta-scientists collecting data and developing new theories of how scientific knowledge is produced. More people see the goal of studying scientific practice not just as diagnosis – identify a problem, write a paper about it, and go to your grave knowing you were right about something – but also of designing effective interventions and embracing the challenges of scaling up to implementation.

A third difference, and perhaps the most profound, is where the ideas and energy are coming from. A popular debate on Twitter is what to call this moment in our field’s history. Is it a crisis? A renaissance? A revolution? One term that gets used a lot is “open science movement.” Once, when this came up on Twitter, I asked a friend who’s a sociologist what he thought. He stared at me for a good three seconds, like I’d just grown a second head, and said: “OF COURSE it’s a social movement.” (It turns out that people debating “are we a social movement?” is classic social movement behavior.) I think that idea has an underappreciated depth to it. Because maybe the biggest difference is that what we are seeing now is truly a grassroots social movement.

What does it mean to take seriously the idea that open science is a social movement? Unlike blue-ribbon task forces, social movements do not usually have a single agenda or a formal charge. They certainly aren’t made up of elites handpicked by august institutions. Instead, movements are coalitions – of individuals, groups, communities, and organizations that have aligned, but often not identical, values, priorities, and ideas.

We see that pluralism in the open science movement. To take just one example, many in psychology see a close connection between openness and rigor. We trace problems with replicability and cumulative scientific progress back, in part, to problems with transparency. When we cannot see details of the methods used to produce important findings, when we cannot see what the data actually look like, when we cannot verify when in the research process key decisions were made, then we cannot properly evaluate claims and evidence. But another very different argument for openness is about access and justice: expanding who gets to see and benefit from the work of scientists, join the discourse around it, and do scientific work. Issues of access would be important no matter how replicable and internally rigorous our science was. Of course, many – and I count myself among them – embrace both of these as animating concerns, even if we came to them from different starting points. That’s one of the powerful things that can happen when movements bring together people with different ideas and different experiences. But as the movement grows and matures, the differences will increase too. Different concerns and priorities will not always be so easily aligned. We need to be ready for that.

SIPS is not the open science movement – the movement is much bigger than we are. Nobody has to be a part of SIPS to do open science or be part of the movement. We should never make the mistake of believing that a SIPS membership defines open science, as my predecessor Katie Corker told us so eloquently last year. But we have the potential to be a powerful force for good within the movement. When SIPS had its first meeting just three years ago, it felt like a small, ragtag band of outsiders who had just discovered they weren’t alone. Now look at us. We have grown in size so fast that our conference organizers could barely keep up. 525 people flew from around the world to get together and do service. Signing up for service! (Don’t tell your department chair.) People are doing it because they believe in our mission and want to do something about it.

This brings me to what I see as the biggest challenge that lies ahead for SIPS. As we have grown and will continue to grow, we need to be asking: What do we do about differences? Both the differences that already exist in our organization, and the differences that could be represented here but aren’t yet. Differences in backgrounds and identities, differences in culture and geography, differences in subfields and interests and approaches. To which my answer is: Differences can be our strength. But that won’t happen automatically. It will take deliberation, intent, and work to make them an asset.

What does that mean? Within the open science movement, many have been working on improvements. But there is a natural tendency for people to craft narrow solutions that just work for themselves, and for people and situations they know. SIPS is at its best when it breaks through that, when it brings together people with different knowledge and concerns to work together. When a discussion about getting larger and more diverse samples includes people who come from different kinds of institutions who have access to different resources, different organizational and technical skills, but see common cause, we get the Psychological Science Accelerator. When people who work with secondary data are in the room talking about preregistration, then instead of another template for a simple two-by-two, we get an AMPPS paper about preregistration for existing data. When mixed-methods researchers feel welcomed one year, they come back the next year with friends and organize a whole session on open qualitative research.

Moving forward, for SIPS to continue to be a force for good, we have to take the same expectations we have of our science and apply them to our movement, our organization, and ourselves. We have to listen to criticism from both within and outside of the society and ask what we can learn from it. Each one of us has to take diversity and inclusion as our own responsibility and ask ourselves, how can I make this not some nice add-on, but integral to the way I am trying to improve psychology? We have to view self-correction and improvement – including improvement in how diverse and inclusive we are – as an ongoing task, not a project we will finish and move on from.

I say this not just as some nice paean to diversity, but as an existential task for SIPS and the open science movement. This is core to our values. If we remake psychological science into something that works smashingly well for the people in this room, but not for anyone else, we will have failed at our mission. The history of collective human endeavors, including social movements – the ways they can reproduce sexism and racism and other forms of inequality, and succumb to power and prestige and faction – gives us every reason to be on our guard. But the energy, passion, and ideals I’ve seen expressed these last few days by the people in this room give me cause for hope. We are, at the end of the day, a service organization. Hundreds of people turned up in Rotterdam to try to make psychology better not just for themselves, but for the world.

So when people ask, “Will this time be different?” my answer is this: Don’t ever feel certain that the answer is yes, and maybe this time it will be.

This is your Brain on Psychology – This is your Psychology on Brain (a guest post by Rob Chavez)

The following is a guest post by Rob Chavez.

If I’m ever asked ‘what was a defining moment of your career?’, I can think of a very specific instance that has stuck with me since my early days as a student in social neuroscience. I was at a journal club meeting where we were discussing an early paper using fMRI to investigate facial processing when looking at members of different racial groups. In this paper, the researchers found greater activity in the amygdala for viewing black faces than for white faces. Although the authors were careful not to say it explicitly, the implication for most readers was clear: The ‘threat center’ turned on for black faces more than white faces, therefore the participants may have implicit fear of black faces. Several students in the group brought up criticisms of that interpretation revolving around how the amygdala is involved in other processes, and we started throwing around ideas for study designs to possibly tease apart alternative explanations (e.g. lower-level visual properties, ambiguity, familiarity) that might also account for the amygdala activity.

Then it happened: The professor in the room finally chimed in. “Look, these are interesting ideas, but they don’t really tell us anything about racial bias. I don’t really care about what the amygdala does; I just care what it can tell us about social psychology.” Even in the nascent stages of my career, I was somewhat flabbergasted. Who wouldn’t want to know everything they possibly could about the thing they are using to draw inferences, especially when that thing is part of the source of mechanism? For me, this event marked a turning point where I began to think of neuroscience less as a method for answering psychological questions and started thinking of it as a multidisciplinary endeavor to which psychology has much to contribute.

Now as a card-carrying social neuroscientist, when I attend conferences, such as the Society for Personality and Social Psychology meeting, I am frequently asked what neuroscience can contribute to our understanding of psychology that we don’t already know from behavioral studies, which are often more flexible, less noisy, and much, much cheaper to run. However, contributing to psychological theory or outperforming behavioral predictions are often not the proximal goals for researchers using neuroimaging methods. Instead, much of the interest in social neuroscience stems from the potential of applying insights from psychology (and other disciplines) to better understand how cognitive and social phenomena are represented in the function and structure of the brain for its own sake, and not simply using the brain as a tool or methodology. I believe that these efforts help us refine the link between these levels of analysis and, frankly, are interesting in their own right.

To be fair to the professor at the journal club, there may be a reason that people hold the view that the brain can simply be used as a method or a tool. Many early researchers using neuroimaging were not originally trained in neuroscience per se but instead transitioned over to it from using other kinds of psychophysiological methods. As such, there are understandable reasons why many researchers doing psychophysiological work don’t have much of a motivation to care as deeply about the underlying physiological process itself. For example, if a researcher is doing a study measuring electrodermal activity, chances are that they don’t care very deeply about sweat (and possibly don’t really care about sympathetic nervous system activity), but rather use it as an indicator of emotional arousal. Put differently, nobody assumes that sweat is the origin of arousal or believes that the fingertip is the organ responsible for the seat of the mind.*

This is not true for the brain, and things start to get even more complicated quickly. Even if you want to just use fMRI amygdala activation to be a marker of threat or fear, the path to do so is not as clear as in other physiological measures. (The amygdala is not even a single structure but rather a collection of functionally distinct nuclei, each with its own functional tuning and connectivity profile to other parts of the brain.) I believe that the way many have been taught to think about measures of brain function and structure has been conflated with some of these more peripheral measures in other parts of the body that are obviously not ‘the source’ of the mind. As such, it doesn’t make much sense to ask how psychological processes are represented in skin conductance in the same way as asking how they are represented in the brain (even if using relatively crude and indirect tools like fMRI). Thus, the common criticism of some neuroimaging work that “we already know that the mind happens in the brain” is shortsighted. Yes, but how, when, and at what level of granularity? However, this perspective is not without its challenges.

One of ways in which brain imaging can be useful to psychologists is to know when activation of a particular brain region or network is indicative of a specific psychological phenomenon. However, in the context of neuroimaging, the term reverse inference is a bit of a dirty phrase.** When someone accuses an fMRI researcher of engaging in reverse inference – drawing conclusions about what psychological process is involved given the activation of a brain area – it is usually a criticism. However, reverse inference is indeed one of the overarching goals of how cognitive and social neuroscience inform psychology in general. We want to know when we can make sound inferences about the psychological processes involved based on neuroimaging metrics in a given task or under certain conditions. Although this is major goal of this endeavor, it is only one of them and is often a distal one. What people ought to be criticizing is premature, decontextualized, or otherwise incomplete reverse inference that overreaches on the conclusions drawn from these methods – Does amygdala activation really mean ‘threat’? Are there other processes involved that might explain it? Does that depend on the particular stimuli being used? Is it a single part of the amygdala or several in concert? Even if replicable, how confident are we that the paradigm being used is representative of the possibility space of reasonable paradigms that could have been used instead? – We have acknowledged for a long time that there is almost never a one-to-one mapping of activity in a single brain region to a single psychological process. Tackling the issues of how then to meaningfully relate psychological processes to the brain is what many of us are working on right now.

However, it’s hard to accomplish these goals while pushing the envelope of psychological theory simultaneously. The collective expectation that cognitive and social neuroscience experiments have an obligation to contribute to our understanding of complex psychological phenomena (and not vice versa) is often too premature to be definitive, much less revolutionary. Moreover, I feel strongly that the insistence that we frame the interpretation of our results in ways that placate this expectation has led many, otherwise cautious, researchers to take unwarranted liberties into the dark side of reverse inference. Ironically, this may have has spurred many criticisms of overreach from cognitive and social neuroscience from others within the broader psychology community. Just as it took years for psychometricians to gain an understanding of how measurement and the scope of our inferences make up the scaffolding upon which we can build psychological theory, it is my hope that a more mature understanding of the intersection between neuroscience and psychology can offer analogous insights. However, given the overwhelming complexity of each of these domains, these efforts will take time.

I sometimes like to think of cognitive/social neuroscience as more of an applied field like, say, psychology of law where researchers are using what we understand about cognition and behavior to inform how the processes are deployed in the legal system except in our case it’s how they are deployed in the brain. If we were talking to a psychology of law scholar, we would never say to them “I don’t care about the law; I just care about what it can tell us about psychology”, because the inferential arrow is not pointed in that direction. For many questions, I think psychology has more to offer neuroscience than the other way around. I am excited to be a part of this endeavor and use my knowledge of both domains to try and build a stronger and more fruitful bridge between them.

At the end of the day, neuroscience is going to move forward whether psychologists want to come along or not. And just as they have in many ‘big data’ domains, engineers working in neuroscience have already started asking questions about psychological phenomena without psychologists’ input. It seems to me like psychologists not only should want representation at the neuroscience table but also recognize that psychology is needed to have a comprehensive understanding of the brain; the engineers will not be able to figure it out without it. I see the work of researchers in my subfield as attempting to fill that chair to some degree. I hope others join us in not only appreciating the beauty of the brain, but also in recognizing the extent to which psychologists’ understanding of the mind and behavior is an essential for contributing to our understanding the very organ that gives rise to them and the challenges therein.

* To be clear, I am not saying that brain imaging is better than psychophysiology for making inferences about psychological phenomena. On the contrary, psychophysiological measures are often clearer and less expensive than their brain imaging counterparts. However, if you care about inference about the neural systems themselves, psychophysiology often can’t say as much about that (with some exceptions, like pupillometry and locus coeruleus activity).

** Because we cannot directly measure most phenomena of interest, almost all psychological measures – including things as simple as reaction times – are technically reverse inferences. Moreover, reaction times involve engaging volitional actions in motor cortex via a cascade of spatiotemporal events in the rest of the brain that are critical for understanding and making appropriate responses for the task at hand. There are many cogs in the machine, and in psychology there is no such thing as a free inference.

Data analysis is thinking, data analysis is theorizing

There is a popular adage about academic writing: “Writing is thinking.” The idea is this: There is a simple view of writing as just an expressive process – the ideas already exist in your mind and you are just putting them on a page. That may in fact be true some of the time, like for day-to-day emails or texting friends or whatever. But “writing is thinking” reminds us that for most scholarly work, the process of writing is a process of continued deep engagement with the world of ideas. Sometimes before you start writing you think you had it all worked out in your head, or maybe in a proposal or outline you wrote. But when you sit down to write the actual thing, you just cannot make it make sense without doing more intellectual heavy lifting. It’s not because you forgot, it’s not because you “just” can’t find the right words. It’s because you’ve got more thinking to do, and nothing other than sitting down and trying to write is going to show that to you.*

Something that is every bit as true, but less discussed and appreciated, is that in the quantitative sciences, the same applies to working with data. Data analysis is thinking. The ideas you had in your head, or the general strategy you wrote into your grant application or IRB protocol, are not enough. If that is all you have done so far, you almost always still have more thinking to do.

This point is exemplified really well in the recent Many Analysts, One Data Set paper. Twenty-nine teams of data analysts were given the same scientific hypothesis to test and the same dataset to test it in. But no two teams ran the same analysis, resulting in 29 different answers. This variability was neither statistical noise nor human error. Rather, the differences in results were because of different reasoned decisions by experienced data analysts. As the authors write in the introduction:

In the scientific process, creativity is mostly associated with the generation of testable hypotheses and the development of suitable research designs. Data analysis, on the other hand, is sometimes seen as the mechanical, unimaginative process of revealing results from a research study. Despite methodologists’ remonstrations…, it is easy to overlook the fact that results may depend on the chosen analytic strategy, which itself is imbued with theory, assumptions, and choice points.

The very end of the quote drives home a second, crucial point. Data analysis is thinking, but it is something else too. Data analysis is theorizing. And it is theorizing no matter how much or how little the analyst is thinking about it.

Scientific theory is not your mental state. Scientific theory consists of statements about nature. When, say, you decide on a scheme for how to exclude outliers in a response-time task, that decision implies a theory of which observations result from processes that are irrelevant to what you are studying and therefore ignorable. When you decide on how to transform variables for a regression, that decision implies a theory of the functional form of the relationships between measured variables. These theories may be longstanding ones, well-developed and deeply studied in the literature. Or they may be ad hoc, one-off implications. Moreover, the content of the analyst’s thinking may be framed in theoretical terms (“hmmm let me think through what’s generating this distribution”), or it may be shallow and rote (“this is how my old advisor said to trim response times”). But the analyst is still making decisions** that imply something about something in nature – the decisions are “imbued with theory.” That’s why scientists can invoke substantive reasons to critique each other’s analyses without probing each other’s mental states. “That exclusion threshold is too low, it excludes valid trials” is an admissible argument, and you don’t have to posit what was in the analyst’s head when you make it.

So data analysis decisions imply statements in theory-space, and in order to think well about them we probably need to think in that space too. To test one theory of interest, the process of data analysis will unavoidably invoke other theories. This idea is not, in fact, new. It is a longstanding and well-accepted principle in philosophy of science called the Duhem-Quine thesis. We just need to recognize that data analysis is part of that web of theories.

This gives us an expanded framework to understand phenomena like p-hacking. The philosopher Imre Lakatos said that if you make a habit of blaming the auxiliaries when your results don’t support your main theory, you are in what he called a degenerative research programme. As you might guess from the name, Imre wasn’t a fan. When we p-hack – try different analysis specifications until we get one we like – we are trying and discarding different configurations of auxiliary theories until we find one that lets us draw a preferred conclusion. We are doing degenerative science, maybe without even realizing it.

On the flip side, this is why preregistration can be a deeply intellectually engaging and rewarding process.*** Because without the data whispering in your ear, “Try it this way and if you get an asterisk we can go home,” you have one less shortcut around thinking about your analysis. You can, of course, leave the thinking until later. You can do so with full awareness and transparency: “This is an exploratory study, and we plan to analyze the data interactively after it is collected.” Or you can fool yourself, and maybe others, if you write a vague or partial preregistration. But if you commit to planning your whole data analysis workflow in advance, you will have nothing but thinking and theorizing to guide you through it. Which, sooner or later, is what you’re going to have to be doing.

* Or, you can write it anyway and not make sense, which also has a parallel in data analysis.
** Or outsourcing them to the software developer who decided what defaults to put in place.
*** I initially dragged my heels on starting to preregister – I know, I know – but when I finally started doing it with my lab, we experienced this for ourselves, somewhat to my own surprise.

What if we talked about p-hacking the way we talk about experimenter effects?

Discussions about p-hacking sometimes go sideways. A hypothetical exchange might go like this:

READER: Those p-values are all hovering just below .05, I bet the authors p-hacked.

AUTHOR: I know that I did not p-hack, and I resent the accusation.

By comparison, consider how we talk about another form of potential bias: experimenter effects.

It is widely accepted that experimenters’ expectations, beliefs, or other characteristics can influence participants in behavioral experiments and medical trials. We also accept that this can happen without intent or even awareness on the part of the experimenter. Expectations about how participants receiving a treatment are supposed to differ from those receiving a placebo might show up in the experimenter’s behavior in subtle ways that could influence the participants.

We also don’t have a complete theory of experimenter effects that allows us to reliably measure every manifestation or predict with high confidence when they will and won’t occur. So instead, we consider them as an open possibility in a wide range of situations. As a result, it is also widely accepted that using procedural safeguards against experimenter effects is a best practice in most experiments where a human experimenter will interact with subjects.

Because of all these shared assumptions, discussions around experimenter effects are often much less heated. If you are presenting a study design at lab meeting, and someone says “you’ll need to keep your RAs blind to condition, here’s an idea how to do that…” that’s generally considered a helpful suggestion rather than an insinuation of planned malfeasance.

And even after a study is done, it is generally considered fair game to ask about blinding and other safeguards, and incorporate their presence or absence into an evaluation of a study. If a study lacks such safeguards, authors generally don’t say things like “I would never stoop so low as to try to influence my participants, how dare you!” Everybody, including authors, understands that experimenters don’t always know how they might be influencing subjects. And when safeguards are missing, readers typically treat it as a reason for doubt and uncertainty. We allow and even expect readers to calibrate that uncertainty judgment based on other assumptions or information, like how plausible the effect seems, how strong or weak did partial or incomplete safeguards seem, etc.

For some reason though, when it comes to potential sources of bias in data analysis, we have not (yet) reached a place where we can talk about it in a similar way. This is despite the fact that it has a lot in common with experimenter effects.

It is certainly possible for somebody to deliberately and strategically p-hack, just like it’s possible for an experimenter to wink and nudge and say “are you sure you’re not feeling better?” or whatever. But bias in data analysis does not have to happen that way. Analysts do not have to have intention or even awareness in order to do things that capitalize on chance.

Consider, first of all, that almost every data analysis involves many decisions: what data to include or exclude, whether or how to transform it, a zillion possibilities in specifying the analysis (what particular variables to look at, what analyses to run on them, whether to use one- or two-tailed tests, what covariates to include, which main, interactive, simple, or contrast effect[s] to treat as critical tests of the hypothesis, etc.), and then decisions about what to report. We psychologists of all people know that you cannot un-know something. So once the analyst has seen anything about the data – distributions, scatterplots, preliminary or interim analyses, whatever else – all the subsequent decisions will be made by a person who has that knowledge. And after that point, it is simply impossible for anybody – including the analyst – to state with any confidence how those decisions might otherwise have been made without that knowledge. Which means that we have to treat seriously the possibility that the analyst made decisions that overfit the analyses to the data.

More subtly, as Gelman and Loken discuss in their “forking paths” paper, bias is not defined by a behavior (how many analyses did you run?), but by a set of counterfactuals (how many analyses could you have run?). So even if the objective history is that one and only one analysis was run, that is not a guarantee of no bias.

What all of this means is that when it comes to bias in data analysis, we are in very much a similar situation as with experimenter effects. It is virtually impossible to measure or observe it happening in a single instance, even by the person doing the data analysis. But what we can do is define a broad set of circumstances where we have to take it seriously as a possibility.

It would be great if we could collectively shift our conversations around this issue. I think that would involve changes from both critical readers and from authors.

Start by considering procedures, not behavior or outcomes. Were safeguards in place, and if so, how effective were they? For bias in data analysis, the most common safeguard is preregistration. The mere existence of a preregistration (as indicated by a badge or an OSF link in a manuscript) tells you very little though – many of them do not actually constrain bias. Sometimes that is even by design (for example, preregistering an exploratory study is a great way to prevent editors or reviewers from pressuring you to HARK later on). A preregistration is just a transparency step, you have to actually read it to find out what it does. In order for a preregistration to prevent analytic bias, it has to do two things. First, it has to have a  decision inventory – that is, it has to identify all of the decisions about what data to collect/analyze, how to analyze it, and what to report. So ask yourself: is there a section on exclusions? Transformations? Does it say what the critical test is? Etc. (This will be easier to do in domains where you are familiar with the analytic workflow for the research area. It can also be aided by consulting templates. And if authors write and post analysis code as part of a preregistration, that can make things clear too.) Second, the preregistration has to have a plan for all of those decision points. To the extent that the inventory is complete and the plans are specific and were determined separate from the data, the preregistration can be an effective safeguard against bias.

When safeguards are missing or incomplete, everyone – authors and readers alike -should treat analytic bias as a serious possibility. If there is no preregistration or other safeguards, then bias is possible. If there is a preregistration but it was vague or incomplete, bias is also possible. In a single instance it is often impossible to know what actually happened, for the reasons I discussed above. It can be reasonable to start looking at indirect stuff like statistical evidence (like the distribution of p-values), whether the result is a priori implausible, etc. Inferences about these things should be made with calibrated uncertainty. p-curves are neither perfect nor useless; improbable things really do happen though by definition rarely; etc. So usually we should not be too sure in any direction.

Inferences about authors should be rare. We should have a low bar for talking about science and a high bar for talking about scientists. This cuts both ways. Casual talk challenging authors’ competence, intentions, unreported behaviors, etc. is often both hurtful and unjustified when we are talking about single papers.* But also, authors’ positive assertions about their character, behavior, etc. rarely shed light and can have the perverse effect of reinforcing the message that they, and not just the work, are a legitimate part of the conversation. As much as possible, make all the nouns in your discussion things like “the results,” “the procedure,” etc. and not “the authors” (or for that matter “my critics”). And whether you are an author, a critic, or even an observer, you can point out when people are talking about authors and redirect the conversation to the work.

I realize this last item draws a razor-thin line and maybe sometimes it is no line at all. After all, things like what safeguards were in place, and what happened if they weren’t, are results of the researcher’s behavior. So even valid criticism implicates what the authors did or didn’t do, and it will likely be personally uncomfortable for them. But it’s a distinction that’s worth observing as much as you can when you criticize work or respond to criticisms. And I would hope we’ve learned from the ways we talk about experimenter effects that it is possible to have less heated, and frankly more substantive, discussions about bias when we do that.

Finally, it is worth pointing out that preregistration and other safeguards are still really new to psychology and many other scientific fields. We are all still learning, collectively, how to do them well. That means that we need to be able to criticize them openly, publicly, and vigorously – if we do not talk about them, we cannot get better at doing them. But it also means that some preregistration is almost always better than none, because even a flawed or incomplete one will increase transparency and make it possible to criticize work more effectively. Even as we critique preregistrations that could have been done better, we should recognize that anybody who makes that critique and improvement possible has done something of value.

* In the bigger picture, for better or worse, science pins career advancement, resources, prestige, etc. to people’s reputations. So at some point we have to be able to talk about these things. This is a difficult topic and not something I want to get into here, other than to say that discussions about who is a good scientist are probably better left to entirely separate conversations from ones where we scientifically evaluate single papers, because the evidentiary standards and consequences are so different.

Accountable replications at Royal Society Open Science: A model for scientific publishing

Kintsugi pottery

Kintsugi pottery. Source: Wikimedia commons.

Six years ago I floated an idea for scientific journals that I nicknamed the Pottery Barn Rule. The name is a reference to an apocryphal retail store policy captured in the phrase, “you break it, you bought it.” The idea is that if you pick something up in the store, you are responsible for what happens to it in your hands.* The gist of the idea in that blog post was as follows: “Once a journal has published a study, it becomes responsible for publishing direct replications of that study. Publication is subject to editorial review of technical merit but is not dependent on outcome.”

The Pottery Barn framing was somewhat lighthearted, but a more serious inspiration for it (though I don’t think I emphasized this much at the time) was newspaper correction policies. When news media take a strong stance on vetting reports of errors and correcting the ones they find, they are more credible in the long run. The good ones understand that taking a short-term hit when they mess up is part of that larger process.**

The core principle driving the Pottery Barn Rule is accountability. When a peer-reviewed journal publishes an empirical claim, its experts have judged the claim to be sound enough to tell the world about. A journal that adopts the Pottery Barn Rule is signaling that it stands behind that judgment. If a finding was important enough to publish the first time, then it is important enough to put under further scrutiny, and the journal takes responsibility to tell the world about those efforts too.

In the years since the blog post, a few journals have picked up the theme in their policies or practices. For example, the Journal of Research in Personality encourages replications and offers an abbreviated review process for studies it had published within the last 5 years. Psychological Science offers a preregistered direct replications submission track for replications of work they’ve published. And Scott Lilienfeld has announced that Clinical Psychological Science will follow the Pottery Barn Rule. In all three instances, these journals have led the field in signaling that they take responsibility for publishing replications.

And now, thanks to hard work by Chris Chambers,*** I was excited to read this morning that the journal Royal Society Open Science has announced a new replication policy that is the fullest implementation yet. Other journals should view the RSOS policy as a model for adoption. In the new policy, RSOS is committing to publishing any technically sound replication of any study it has published, regardless of the result, and providing a clear workflow for how it will handle such studies.

What makes the RSOS policy stand out? Accountability means tying your hands – you do not get to dodge it when it will sting or make you look bad. Under the RSOS policy, editors will still judge the technical faithfulness of replication studies. But they cannot avoid publishing replications on the basis of perceived importance or other subjective factors. Rather, whatever determination the journal originally made about those subjective questions at the time of the initial publication is applied to the replication. Making this a firm commitment, and having it spelled out in a transparent written policy, means that the scientific community knows where the journal stands and can easily see if the journal is sticking to its commitment. Making it a written policy (not just a practice) also means it is more likely to survive past the tenure of the current editors.

Such a policy should be a win both for the journal and for science. For RSOS – and for authors that publish there – articles will now have the additional credibility that comes from a journal saying it will stand by the decision. For science, this will contribute to a more complete and less biased scientific record.

Other journals should now follow suit. Just as readers would trust a news source more if they are transparent about corrections — and less if they leave it to other media to fix their mistakes — readers should have more trust in journals that view replications of things they’ve published as their responsibility, rather than leaving them to other (often implicitly “lesser”) journals. Adopting the RSOS policy, or one like it, will be way for journals to raise the credibility of the work that they publish while they make scientific publishing more rigorous and transparent.

* In reality, the actual Pottery Barn store will quietly write off the breakage as a loss and let you pretend it never happened. This is probably not a good model to emulate for science.

** One difference is that because newspapers report concrete facts, they work from a presumption that they got those things right, and they only publish corrections for errors. Whereas in science, uncertainty looms much larger in our epistemology. We draw conclusions from the accumulation of statistical evidence, so the results of all verification attempts have value regardless of outcome. But the common theme across both domains is being accountable for the things you have reported.

*** You may remember Chris from such films as registered reports and stop telling the world you can cure cancer because of seven mice.

The replication price-drop in social psychology

Why is the replication crisis centered on social psychology? In a recent post, Andrew Gelman offered a list of possible reasons. Although I don’t agree with every one of his answers (I don’t think data-sharing is common in social psych for example), it is an interesting list of ideas and an interesting question.

I want to riff on one of those answers, because it is something I’ve been musing about for a while. Gelman suggests that in social psychology, experiments are comparatively easy and cheap to replicate. Let’s stipulate that this is true of at least some parts of social psych. (Not necessarily all of them – I’ll come back to that.) What would easy and cheap replications do for a field? I’d suggest they have two, somewhat opposing effects.

On the one hand, running replications is the most straightforward way to obtain evidence about whether an effect is replicable.1 So the easier it is to run a replication, the easier it will be to discover if a result is a fluke. Broaden that out, and if a field has lots of replicability problems and replications are generally easy to run, it should be easier to diagnose the field.

But on the other hand, in a field or area where it is easy to run replications, that should facilitate a scientific ecosystem where unreplicable work can get weeded out. So over time, you might expect a field to settle into an equilibrium where by routinely running those easy and cheap replications, it is keeping unreplicable work at a comfortably low rate.2 No underlying replication problem, therefore no replication crisis.

The idea that I have been musing on for a while is that “replications are easy and cheap” is a relatively new development in social psychology, and I think that may have some interesting implications. I tweeted about it a while back but I thought I’d flesh it out.

Consider that until around a decade ago, almost all social psychology studies were run in person. You might be able to do a self-report questionnaire study in mass testing sessions, but a lot of experimental protocols could only be run a few subjects at a time. For example, any protocol that involved interaction or conditional logic (i.e., couldn’t just be printed on paper for subjects to read) required live RAs to run the show. A fair amount of measurement was analog and required data entry. And computerized presentation or assessment was rate-limited by the number of computers and cubicles a lab owned. All of this meant that even a lot of relatively simpler experiments required a nontrivial investment of labor and maybe money. And a lot of those costs were per-subject costs, so they did not scale up well.

All of this changed only fairly recently, with the explosion of internet experimentation. In the early days of the dotcom revolution you had to code websites yourself,3 but eventually companies like Qualtrics sprung up with pretty sophisticated and usable software for running interactive experiments. That meant that subjects could complete many kinds of experiments at home without any RA labor to run the study. And even for in-lab studies, a lot of data entry – which had been a labor-intensive part of running even a simple self-report study – was cut out. (Or if you were already using experiment software, you went from having to buy a site license for every subject-running computer to being able to run it on any device with a browser, even a tablet or phone.) And Mechanical Turk meant that you could recruit cheap subjects online in large numbers and they would be available virtually instantly.

All together, what this means is that for some kinds of experiments in some areas of psychology, replications have undergone a relatively recent and sizeable price drop. Some kinds of protocols pretty quickly went from something that might need a semester and a team of RAs to something you could set up and run in an afternoon.4 And since you weren’t booking up your finite lab space or spending a limited subject-pool allocation, the opportunity costs got lower too.

Notably, growth of all of the technologies that facilitated the price-drop accelerated right around the same time as the replication crisis was taking off. Bem, Stapel, and false-positive psychology were all in 2011. That’s the same year that Buhrmester et al published their guide to running experiments on Mechanical Turk, and just a year later Qualtrics got a big venture capital infusion and started expanding rapidly.

So my conjecture is that the sudden price drop helped shift social psychology out of a replications-are-rare equilibrium and moved it toward a new one. In pretty short order, experiments that previously would have been costly to replicate (in time, labor, money, or opportunity) got a lot cheaper. This meant that there was a gap between the two effects of cheap replications I described earlier: All of a sudden it was easy to detect flukes, but there was a buildup of unreplicable effects in the literature from the old equilibrium. This might explain why a lot of replications in the early twenty-teens were social priming5 studies and similar paradigms that lend themselves to online experimentation pretty well.

To be sure, I don’t think this could by any means be a complete theory. It’s more of a facilitating change along with other factors. Even if replications are easy and cheap, researchers still need to be motivated to go and run them. Social psychology had a pretty strong impetus to do that in 2011, with Bem, Stapel, and False-positive psychology all breaking in short order. And as researchers in social psychology started finding cause for concern in those newly-cheap studies, they were motivated to widen their scope to replicating other studies that had been designed, analyzed, and reported in similar ways but that hadn’t had so much of a price-drop.

To date that broadening-out from the easy and cheap studies hasn’t spread nearly as much to other subfields like clinical psychology or developmental psychology. Perhaps there is a bit of an ingroup/outgroup dynamic – it is easier to say “that’s their problem over there” than to recognize commonalities. And those fields don’t have a bunch of cheap-but-influential studies of their own to get them started internally.6

An optimistic spin on all this is that social psychology could be be on its way to a new equilibrium where running replications becomes more of a normal thing. But there will need to be an accompanying culture shift where researchers get used to seeing replications as part of mainstream scientific work.

Another implication is that the price-drop and resulting shift in equilibrium has created a kind of natural experiment where the weeding-out process has lagged behind the field’s ability to run cheap replications. A boom in metascience research has taken advantage of this lag to generate insights into what does7 and doesn’t8 make published findings less likely to be replicated. Rather than saying “oh that’s those people over there,” fields and areas where experiments are difficult and expensive could and should be saying, wow, we could have a problem and not even know it – but we can learn some lessons from seeing how “those people over there” discovered they had a problem and what they learned about it.

  1. Hi, my name is Captain Obvious. 
  2. Conversely, it is possible that a field where replications are hard and expensive might reach an equilibrium where unreplicable findings could sit around uncorrected. 
  3. RIP the relevance of my perl skills. 
  4. Or let’s say a week + an afternoon if you factor in getting your IRB exemption. 
  5. Yeah, I said social priming without scare quotes. Come at me. 
  6. Though admirably, some researchers in those fields are now trying anyway, costs be damned. 
  7. Selective reporting of underpowered results
  8. Hidden moderators

Reflections on SIPS (guest post by Neil Lewis, Jr.)

The following is a guest post by Neil Lewis, Jr. Neil is an assistant professor at Cornell University.

Last week I visited the Center for Open Science in Charlottesville, Virginia to participate in the second annual meeting of the Society for the Improvement of Psychological Science (SIPS). It was my first time going to SIPS, and I didn’t really know what to expect. The structure was unlike any other conference I’ve been to—it had very little formal structure—there were a few talks and workshops here and there, but the vast majority of the time was devoted to “hackathons” and “unconference” sessions where people got together and worked on addressing pressing issues in the field: making journals more transparent, designing syllabi for research methods courses, forming a new journal, changing departmental/university culture to reward open science practices, making open science more diverse and inclusive, and much more. Participants were free to work on whatever issues we wanted to and to set our own goals, timelines, and strategies for achieving those goals.

I spent most of the first two days at the diversity and inclusion hackathon that Sanjay and I co-organized. These sessions blew me away. Maybe we’re a little cynical, but going into the conference we thought maybe two or three people would stop by and thus it would essentially be the two of us trying to figure out what to do to make open science more diverse and inclusive. Instead, we had almost 40 people come and spend the first day identifying barriers to diversity and inclusion, and developing tools to address those barriers. We had sub-teams working on (1) improving measurement of diversity statistics (hard to know how much of a diversity problem one has if there’s poor measurement), (2) figuring out methods to assist those who study hard-to-reach populations, (3) articulating the benefits of open science and resources to get started for those who are new, (4) leveraging social media for mentorship on open science practices, and (5) developing materials to help PIs and institutions more broadly recruit and retain traditionally underrepresented students/scholars. Although we’re not finished, each team made substantial headway in each of these areas.

On the second day, those teams continued working, but in addition we had a “re-hack” that allowed teams that were working on other topics (e.g., developing research methods syllabi, developing guidelines for reviewers, starting a new academic journal) to present their ideas and get feedback on how to make their projects/products more inclusive from the very beginning (rather than having diversity and inclusion be an afterthought as is often the case). Once again, it was inspiring to see how committed people were to making sure so many dimensions of our science become more inclusive.

These sessions, and so many others at the conference, gave me a lot of hope for the field—hope that I (and I suspect others) could really use (special shout-outs to Jessica Flake’s unconference on improving measurement, Daniel Lakens and Jeremy Biesanz’s workshop on sample size and effect size, and Liz Page-Gould and Alex Danvers’s workshop on Fundamentals of R for data analysis). It’s been a tough few years to be a scientist. I was working on my PhD in social psychology at the time that the open science collaborative published their report estimating the reproducibility of psychological science to be somewhere between one-third and one-half. Then a similar report came out about the state of cancer research – only twenty five percent of papers replicated there. Now it seems like at least once a month there is some new failed replication study or some other study comes out that has major methodological flaw(s). As someone just starting out, constantly seeing findings I learned were fundamental fail to replicate, and new work emerge so flawed, I often find myself wondering (a) what the hell do we actually know, and (b) if so many others can’t get it right, what chance do I have?

Many Big Challenges with No Easy Solutions

To try and minimize future fuck-ups in my own work, I started following a lot of methodologists on Twitter so that I could stay in the loop on what I need to do to get things right (or at least not horribly wrong). There are a lot of proposed solutions out there (and some argument about those solutions, e.g., p < .005) but there are some big ones that seem to have reached consensus, including vastly increasing the size of our samples to increase the reliability of findings. These solutions make sense for addressing the issues that got us to this point, but the more I’ve thought about and talked to others about them, the more it became clear that some may unintentionally create another problem along the way, which is to “crowd out” some research questions and researchers. For example, when talking with scholars who study hard-to-reach populations (e.g., racial and sexual minorities), a frequently voiced concern is that it is nearly impossible to recruit the sample sizes needed to meet new thresholds of evidence.

To provide an example from my own research, I went to graduate school intending to study racial-ethnic disparities in academic outcomes (particularly Black-White achievement gaps). In my first semester at the University of Michigan I asked my advisor to pay for a pre-screen of the department of psychology’s participant pool to see how many Black students I would have to work with if I pursued that line of research. There were 42 Black students in the pool that semester. Forty-two. Out of 1,157. If memory serves me well, that was actually one of the highest concentrations of Black students in the pool in my entire time there. Seeing that, I asked others who study racial minorities what they did. I learned that unless they had well-funded advisors that could afford to pay for their samples, many either shifted their research questions to topics that were more feasible to study, or they would spend their graduate careers collecting data for one or two studies. In my area, that latter approach was not practical for being employable—professional development courses taught us that search committees expect multiple publications in the flagship journals, and those flagship journals usually require multiple studies for publication.

Learning about those dynamics, I temporarily shifted my research away from racial disparities until I figured out how to feasibly study those topics. In the interim, I studied other topics where I could recruit enough people to do the multi-study papers that were expected. That is not to say I am uninterested in those other topics I studied (I very much am) but disparities were what interested me most. Now, some may read that and think ‘Neil, that’s so careerist of you! You should have pursued the questions you were most passionate about, regardless of how long it took!’ And on an idealistic level, I agree with those people. But on a practical level—I have to keep a roof over my head and eat. There was no safety net at home if I was unable to get a job at the end of the program. So I played it safe for a few years before going back to the central questions that brought me to academia in the first place.

That was my solution. Others left altogether. As one friend depressingly put it—“there’s no more room for people like us; unless we get lucky with the big grants that are harder and harder to get, we can’t ask our questions—not when power analyses now say we need hundreds per cell; we’ve been priced out of the market.” And they’re not entirely wrong. Some collaborators and I recently ran a survey experiment with Black American participants; it was a 20-minute survey with 500 Black Americans. That one study cost us $11,000. Oh, and it’s a study for a paper that requires multiple studies. The only reason we can do this project is because we have a senior faculty collaborator who has an endowed chair and hence deep research pockets.

So that is the state of affairs. The goal post keeps shifting, and it seems that those of us who already had difficulty asking our questions have to choose between pursuing the questions we’re interested in, and pursuing questions that are practical for keeping roofs over our heads (e.g., questions that can be answered for $0.50 per participant on MTurk). And for a long time this has been discouraging because it felt as though those who have been leading the charge on research reform did not care. An example that reinforces this sentiment is a quote that floated around Twitter just last week. A researcher giving a talk at a conference said “if you’re running experiments with low sample n, you’re wasting your time. Not enough money? That’s not my problem.”

That researcher is not wrong. For all the reasons methodologists have been writing about for the past few years (and really, past few decades), issues like small sample sizes do compromise the integrity of our findings. At the same time, I can’t help but wonder about what we lose when the discussion stops there, at “that’s not my problem.” He’s right—it’s not his personal problem. But it is our collective problem, I think. What questions are we missing out on when we squeeze out those who do not have the thousands or millions of dollars it takes to study some of these topics? That’s a question that sometimes keeps me up at night, particularly the nights after conversations with colleagues who have incredibly important questions that they’ll never pursue because of the constraints I just described.

A Chance to Make Things Better

Part of what was so encouraging about SIPS was that we not only began discussing these issues, but people immediately took them seriously and started working on strategies to address them—putting together resources on “small-n designs” for those who can’t recruit the big samples, to name just one example. I have never seen issues of diversity and inclusion taken so seriously anywhere, and I’ve been involved in quite a few diversity and inclusion initiatives (given the short length of my career). At SIPS, people were working tirelessly to make actionable progress on these issues. And again, it wasn’t a fringe group of women and minority scholars doing this work as is so often the case—we had one of the largest hackathons at the conference. I really wish more people were there witness it—it was amazing, and energizing. It was the best of science—a group of committed individuals working incredibly hard to understand and address some of the most difficult questions that are still unanswered, and producing practical solutions to pressing social issues.

Now it is worth noting that I had some skepticism going into the conference. When I first learned about it I went back-and-forth on whether I should go; and even the week before the conference, I debated canceling the trip. I debated canceling because there was yet another episode of the “purely hypothetical scenario” that Will Gervais described in his recent blog post:

A purely hypothetical scenario, never happens [weekly+++]

Some of the characters from that scenario were people I knew would be attending the conference. I was so disgusted watching it unfold that I had no desire to interact with them the following week at the conference. My thought as I watched the discourse was: if it is just going to be a conference of the angry men from Twitter where people are patted on the back for their snark, using a structure from the tech industry—an industry not known for inclusion, then why bother attend? Apparently, I wasn’t alone in that thinking. At the diversity hackathon we discussed how several of us invited colleagues to come who declined because, due to their perceptions of who was going to be there and how those people often engage on social media, they did not feel it was worth their time.

I went despite my hesitation and am glad I did—it was the best conference I’ve ever attended. The attendees were not only warm and welcoming in real life, they also seemed to genuinely care about working together to improve our science, and to improve it in equitable and inclusive ways. They really wanted to hear what the issues are, and to work together to solve them.

If we regularly engage with each other (both online and face-to-face) in the ways that participants did at SIPS 2017, the sky is the limit for what we can accomplish together. The climate in that space for those few days provided the optimal conditions for scientific progress to occur. People were able to let their guards down, to acknowledge that what we’re trying to do is f*cking hard and that none of us know all the answers, to admit and embrace that we will probably mess up along the way, and that’s ok. As long as we know more and are doing better today than we knew and did yesterday, we’re doing ok – we just have to keep pushing forward.

That approach is something that I hope those who attended can take away, and figure out how to replicate in other contexts, across different mediums of communication (particularly online). I think it’s the best way to do, and to improve, our science.

I want to thank the organizers for all of the work they put into the conference. You have no idea how much being in that setting meant to me. I look forward to continuing to work together to improve our science, and hope others will join in this endeavor.