Data peeking is always wrong (except when you do it right)

Imagine that you have entered a charity drawing to win a free iPad. The charity organizer draws a ticket, and it’s your number. Hooray! But wait, someone else is cheering too. After a little investigation it turns out that due to a printing error, two different tickets had the same winning number. You don’t want to be a jerk and make the charity buy another iPad, and you can’t saw it in half. So you have to decide who gets the iPad.

Suppose that someone proposes to flip a coin to decide who gets the iPad. Sounds pretty fair, right?

But suppose that the other guy with a winning ticket — let’s call him Pete — instead proposes the following procedure. First the organizer will flip a coin. If Pete wins that flip, he gets the iPad. But if you win the flip, then the organizer will toss the coin 2 more times. If Pete wins best out of 3, he gets the iPad. If you win best out of 3, then the organizer will flip yet another 2 times. If Pete wins the best out of those (now) 5 flips, he gets the iPad. If not, keep going… Eventually, if Pete gets tired and gives up before he wins the iPad, you can have it.

Doesn’t sound so fair, does it?

The procedure I just described is not all that different from the research practice of data peeking. Data peeking goes something like this: you run some subjects, then do an analysis. If it comes out significant, you stop. If not, you run some more subjects and try again. What Peeky Pete’s iPad Procedure and data-peeking have in common is that you are starting with a process that includes randomness (coin flips, or the random error in subjects’ behavior) but then using a biased rule to stop the random process when it favors somebody’s outcome. Which means that the “randomness” is no longer random at all.

Statisticians have been studying the consequences of data-peeking for a long time (e.g., Armitage et al., 1969). But the practice has received new attention recently in psychology, in large part because of the Simmons et al. false-positive psychology paper that came out last year. Given this attention, it is fair to wonder (1) how common is data-peeking, and (2) how bad is it?

How common is data-peeking?

Anecdotally, lot of people seem to think data peeking is common. Tal Yarkoni described data peeking as “a time-honored tradition in the social sciences.” Dave Nussbaum wrote that “Most people don’t realize that looking at the data before collecting all of it is much of a problem,” and he says that until recently he was one of those people. Speaking from my own anecdotal experience, ever since Simmons et al. came out I’ve had enough casual conversations with colleagues in social psychology that have brought me around to thinking that data peeking is not rare. And in fact, I have talked to more than one fMRI researcher who considers data peeking not only acceptable but beneficial (more on that below).

More formally, when Leslie John and others surveyed academic research psychologists about questionable research practices, a majority (55%) outright admitted that they have “decid[ed] whether to collect more data after looking to see whether the results were significant.” John et al. use a variety of techniques to try to correct for underreporting; they estimate the real prevalence to be much higher. On the flip side, it is at least a little ambiguous whether some respondents might have interpreted “deciding whether to collect more data” to include running a new study, rather than adding new subjects to an existing one. But the bottom line is that data-peeking does not seem to be at all rare.

How bad is it?

You might be wondering, is all the fuss about data peeking just a bunch of rigid stats-nerd orthodoxy, or does it really matter? After all, statisticians sometimes get worked up about things that don’t make much difference in practice. If we’re talking about something that turns a 5% Type I error rate into 6%, is it really a big deal?

The short answer is yes, it’s a big deal. Once you start looking into the math behind data-peeking, it quickly becomes apparent that it has the potential to seriously distort results. Exactly how much depends on a lot of factors: how many cases you run before you take your first peek, how frequently you peek after that, how you decide when to keep running subjects and when to give up, etc. But a good and I think realistic illustration comes from some simulations that Tal Yarkoni posted a couple years ago. In one instance, Tal simulated what would happen if you run 10 subjects and then start peeking every 5 subjects after that. He found that you would effectively double your type I error rate by the time you hit 20 subjects. If you peek a little more intensively and run a few more subjects it gets a lot worse. Under what I think are pretty realistic conditions for a lot of psychology and neuroscience experiments, you could easily end up reporting p<.05 when the true false-positive rate is closer to p=20.

And that’s a serious difference. Most researchers would never dream of looking at a p=.19 in their SPSS output and then blatantly writing p<.05 in a manuscript. But if you data-peek enough, that could easily end up being de facto what you are doing, even if you didn’t realize it. As Tal put it, “It’s not the kind of thing you just brush off as needless pedantry.”

So what to do?

These issues are only becoming more timely, given current concerns about replicability in psychology. So what to do about it?

The standard advice to individual researchers is: don’t data-peek. Do a power analysis, set an a priori sample size, and then don’t look at your data until you are done for good. This should totally be the norm in the vast majority of psychology studies.

And to add some transparency and accountability, one of Psychological Science’s proposed disclosure statements would require you to state clearly how you determined your sample size. If that happens, other journals might follow after that. If you believe that most researchers want to be honest and just don’t realize how bad data-peeking is, that’s a pretty good way to spread the word. People will learn fast once their papers start getting sent back with a request to run a replication (or rejected outright).

But is abstinence the only way to go? Some researchers make a cost-benefit case for data peeking. The argument goes as follows: With very expensive procedures (like fMRI), it is wasteful to design high-powered studies if that means you end up running more subjects than you need to determine if there is an effect. (As a sidenote, high-powered studies are actually quite important if you are interested in unbiased (or at least less biased) effect size estimation, but that’s a separate conversation; here I’m assuming you only care about significance.) And on the flip side, the argument goes, Type II errors are wasteful too — if you follow a strict no-data-peeking policy, you might run 20 subjects and get p=.11 and then have to set aside the study and start over from scratch.

Of course, it’s also wasteful to report effects that don’t exist. And compounding that, studies that use expensive procedures are also less likely to get directly replicated, which means that false-positive errors are harder to get found out.

So if you don’t think you can abstain, the next-best thing is to use protection. For those looking to protect their p I have two words: interim analysis. It turns out this is a big issue in the design of clinical trials. Sometimes that is for very similar expense reasons. And sometimes it is because of ethical and safety issues: often in clinical trials you need ongoing monitoring so that you can stop the trial just as soon as you can definitively say that the treatment makes things better (so you can give it to the people in the placebo condition) or worse (so you can call off the trial). So statisticians have worked out a whole bunch of ways of designing and analyzing studies so that you can run interim analyses while keeping your false-positive rate in check. (Because of their history, such designs are sometimes called sequential clinical trials, but that shouldn’t chase you off — the statistics don’t care if you’re doing anything clinical.) SAS has a whole procedure for analyzing them, PROC SEQDESIGN. And R users have lots of options. (I don’t know if these procedures have been worked into fMRI analysis packages, but if they haven’t, they really should be.)

Very little that I’m saying is new. These issues have been known for decades. And in fact, the 2 recommendations I listed above — determine sample size in advance or properly account for interim testing in your analyses — are the same ones Tal made. So I could have saved some blog space and just said “please go read Armitage et al or Todd et al. or Yarkoni & Braver.” (UPDATE via a commenter: or Strube, 2006.)But blog space is cheap, and as far as I can tell word hasn’t gotten out very far. And with (now) readily available tools and growing concerns about replicability, it is time to put uncorrected data peeking to an end.

Psychological Science to publish direct replications (maybe)

Pretty big news. Psychological Science is seriously discussing 3 new reform initiatives. They are outlined in a letter being circulated by Eric Eich, editor of the journal, and they come from a working group that includes top people from APS and several other scientists who have been active in working for reforms.

After reading it through (which I encourage everybody to do), here are my initial takes on the 3 initiatives:

Initiative 1: Create tutorials on power, effect size, and confidence intervals. There’s plenty of stuff out there already, but if PSci creates a good new source and funnels authors to it, it could be a good thing.

Initiative 2: Disclosure statements about research process (such as how sample size was determined, unreported measures, etc.) This could end up being a good thing, but it will be complicated. Simine Vazire, one of the working group members who is quoted in the proposal, puts it well:

We are essentially asking people to “incriminate” themselves — i.e., reveal information that, in the past, editors have treated as reasons not to publish a paper. If we want authors to be honest, I think they will want some explicit acknowledgement that some degree of messiness (e.g., a null result here and there) will be tolerated and perhaps even treated as evidence that the entire set of findings is even more plausible (a la [Gregory] Francis, [Uli] Schimmack, etc.).

I bet there would be low consensus about what kinds and amounts of messiness are okay, because no one is accustomed to seeing that kind of information on a large scale in other people’s studies. It is also the case that things that are problematic in one subfield may be more reasonable in another. And reviewers and editors who lack the time or local expertise to really judge messiness against merit may fall back on simplistic heuristics rather than thinking things through in a principled way. (Any psychologist who has ever tried to say anything about causation, however tentative and appropriately bounded, in data that was not from a randomized experiment probably knows what that feels like.)

Another basic issue is whether people will be uniformly honest in the disclosure statements. I’d like to believe so, but without a plan for real accountability I’m not sure. If some people can get away with fudging the truth, the honest ones will be at a disadvantage.

3. A special submission track for direct replications, with 2 dedicated Associate Editors and a system of pre-registration and prior review of protocols to allow publication decisions to be decoupled from outcomes. A replication section at a journal? If you’ve read my blog before you might guess that I like that idea a lot.

The section would be dedicated to studies previously published in Psychological Science, so in that sense it is in the same spirit as the Pottery Barn Rule. The pre-registration component sounds interesting — by putting a substantial amount of review in place before data are collected, it helps avoid the problem of replications getting suppressed because people don’t like the outcomes.

I feel mixed about another aspect of the proposal, limiting replications to “qualified” scientists. There does need to be some vetting, but my hope is that they will set the bar reasonably low. “This paradigm requires special technical knowledge” can too easily be cover for “only people who share our biases are allowed to study this effect.” My preference would be for a pro-data, pro-transparency philosophy. Make it easy for for lots of scientists to run and publish replication studies, and make sure the replication reports include information about the replicating researchers’ expertise and experience with the techniques, methods, etc. Then meta-analysts can code for the replicating lab’s expertise as a moderator variable, and actually test how much expertise matters.

My big-picture take. Retraction Watch just reported yesterday on a study showing that retractions, especially retractions due to misconduct, cause promising scientists to move to other fields and funding agencies to direct dollars elsewhere. Between alleged fraud cases like Stapel, Smeesters, and Sanna, and all the attention going to false-positive psychology and questionable research practices, psychology (and especially social psychology) is almost certainly at risk of a loss of talent and money.

Getting one of psychology’s top journals to make real reforms, with the institutional backing of APS, would go a long way to counteract those negative effects. A replication desk in particular would leapfrog psychology past what a lot of other scientific fields do. Huge credit goes to Eric Eich and everyone else at APS and the working group for trying to make real reforms happen. It stands a real chance of making our science better and improving our credibility.

William James contemplates getting out of a warm bed on a cold morning

This morning felt quite ignominious indeed, and naturally it reminded me of William James. From the Principles of Psychology, chapter 26, “Will”:

We know what it is to get out of bed on a freezing morning in a room without a fire, and how the very vital principle within us protests against the ordeal. Probably most persons have lain on certain mornings for an hour at a time unable to brace themselves to the resolve. We think how late we shall be, how the duties of the day will suffer; we say, “I must get up, this is ignominious,” etc.; but still the warm couch feels too delicious, the cold outside too cruel, and resolution faints away and postpones itself again and again just as it seemed on the verge of bursting the resistance and passing over into the decisive act. Now how do we ever get up under such circumstances? If I may generalize from my own experience, we more often than not get up without any struggle or decision at all. We suddenly find that we have got up. A fortunate lapse of consciousness occurs; we forget both the warmth and the cold; we fall into some revery connected with the day’s life, in the course of which the idea flashes across us, “Hollo! I must lie here no longer” – an idea which at that lucky instant awakens no contradictory or paralyzing suggestions, and consequently produces immediately its appropriate motor effects. It was our acute consciousness of both the warmth and the cold during the period of struggle, which paralyzed our activity then and kept our idea of rising in the condition of wish and not of will. The moment these inhibitory ideas ceased, the original idea exerted its effects.

James’s visible presence in contemporary psychology seem mostly limited to 2 roles. Someone finds a quote, puts it as an epigram at the top of their manuscript, and claims that their ideas have roots going more than a century back. Or alternatively someone finds a quote, puts it up as an epigram, and then claims that their ideas overturn more than a century of received wisdom.

But going back and actually re-reading James seriously is usually an enlightening activity. Just from that chapter on “Will” you can draw lines to contemporary research on delay of gratification, self-regulatory depletion, goal pursuit, the relationship between attention and executive control, and automaticity. James’s ideas about all of these topics are nuanced, with a lot of connections but few easy one-to-one mappings (whether supported or falsified) to contemporary research. Every once in a while I get the urge to go back and look at something James wrote, and if no contradictory or paralyzing suggestions stop me, I’m always glad that I did.

Norms for the Big Five Inventory and other personality measures

Every once in a while I get emails asking me about norms for the Big Five Inventory. I got one the other day, and I figured that if more than one person has asked about it, it’s probably worth a blog post.

There’s a way of thinking about norms — which I suspect is the most common way of thinking about norms — that treats them as some sort of absolute interpretive framework. The idea is that you could tell somebody, hey, if you got this score on the Agreeableness scale, it means you have this amount of agreeableness.

But I generally think that’s not the right way of thinking about it. Lew Goldberg put it this way:

One should be very wary of using canned “norms” because it isn’t obvious that one could ever find a population of which one’s present sample is a representative subset. Most “norms” are misleading, and therefore they should not be used.

That is because “norms” are always calculated in reference to some particular sample, drawn from some particular population (which BTW is pretty much never “the population of all human beings”). Norms are most emphatically NOT an absolute interpretation — they are unavoidably comparative.

So the problem arises because the usual way people talk about norms tends to bury that fact. So people say, oh, you scored at the 70th percentile. They don’t go on to say the 70th percentile of what. For published scales that give normed scores, it often turns out to mean the 70th percentile of the distribution of people who somehow made it into the scale author’s convenience sample 20 years ago.

So what should you do to help people interpret their scores? Lew’s advice is to use the sample you have at hand to construct local norms. For example, if you’re giving feedback to students in a class, tell them their percentile relative to the class.

Another approach is to use distributional information from existing dataset and just be explicit about what comparison you are making and where the data come from. For the BFI, I sometimes refer people to a large dataset of adult American Internet users that I used for a paper. Sample descriptives are in the paper, and we’ve put up a table of means and SDs broken down by age and gender for people who want to make those finer distinctions. You can then use those means and SDs to convert your raw scores into z-scores, and then calculate or look up the normal-distribution percentile. You would then say something like, “This is where you stand relative to a bunch of Internet users who took this questionnaire online.” (You don’t have to use that dataset, of course. Think about what would be an appropriate comparison group and then poke around Google Scholar looking for a paper that reports descriptive statistics for the kind of sample you want.)

Either the “local norms” approach or the “comparison sample” approach can work for many situations, though local norms may be difficult for very small samples. If the sample as a whole is unusual in some way, the local norms will remove the average “unusualness” whereas the comparison-sample approach will keep it in there, and you can decide which is the more useful comparison. (For example, an astronaut who scores in the 50th percentile of conscientiousness relative to other astronauts would be around the 93rd percentile relative to college undergrads.) But the most important thing is to avoid anything that sounds absolute. Be consistent and clear about the fact that you are making comparisons and about who you are comparing somebody to.

What counts as a successful or failed replication?

Let’s say that some theory states that people in psychological state A1 will engage in behavior B more than people in psychological state A2. Suppose that, a priori, the theory allows us to make this directional prediction, but not a prediction about the size of the effect.

A researcher designs an experiment — call this Study 1 — in which she manipulates A1 versus A2 and then measures B. Consistent with the theory, the result of Study 1 shows more of behavior B in condition A1 than A2. The effect size is d=0.8 (a large effect). A null hypothesis significance test shows that the effect is significantly different from zero, p<.05.

Now Researcher #2 comes along and conducts Study 2. The procedures of Study 2 copy Study 1 as closely as possible — the same manipulation of A, the same measure of B, etc. The result of Study 2 shows more of behavior B in condition A1 than in A2 — same direction as Study 1. In Study 2, the effect size is d=0.3 (a smallish effect). A null hypothesis significance test shows that the effect is significantly different from zero, p<.05. But a comparison of the Study 1 effect to the Study 2 effect (d=0.8 versus d=0.3) is also significant, p<.05.

Here’s the question: did Study 2 successfully replicate Study 1?

My answer is no. Here’s why. When we say “replication,” we should be talking about whether we can reproduce a result. A statistical comparison of Studies 1 and 2 shows that they gave us significantly different results. We should be bothered by the difference, and we should be trying to figure out why.

People who would call Study 2 a “successful” replication of Study 1 are focused on what it means for the theory. The theoretical statement that inspired the first study only spoke about direction, and both results came out in the same direction. By that standard you could say that it replicated.

But I have two problems with defining replication in that way. My first problem is that, after learning the results of Study 1, we had grounds to refine the theory to include statements about the likely range of the effect’s size, not just its direction. Those refinements might be provisional, and they might be contingent on particular conditions (i.e., the experimental conditions under which Study 1 was conducted), but we can and should still make them. So Study 2 should have had a different hypothesis, a more focused one, than Study 1. Theories should be living things, changing every time they encounter new data. If we define replication as testing the theory twice then there can be no replication, because the theory is always changing.

My second problem is that we should always be putting theoretical statements to multiple tests. That should be such normal behavior in science that we shouldn’t dilute the term “replication” by including every possible way of doing it. As Michael Shermer once wrote, “Proof is derived through a convergence of evidence from numerous lines of inquiry — multiple, independent inductions all of which point to an unmistakable conclusion.” We should all be working toward that goal all the time.

This distinction — between empirical results vs. conclusions about theories — goes to the heart of the discussion about direct and conceptual replication. Direct replication means that you reproduce, as faithfully as possible, the procedures and conditions of the original study. So the focus should rightly be on the result. If you get a different result, it either means that despite your best efforts something important differed between the two studies, or that one of the results was an accident.

By contrast, when people say “conceptual replication” they mean that they have deliberately changed one or more major parts of the study — like different methods, different populations, etc. Theories are abstractions, and in a “conceptual replication” you are testing whether the abstract theoretical statement (in this case, B|A1 > B|A2) is still true under a novel concrete realization of the theory. That is important scientific work, but it differs in huge, qualitative ways from true replication. As I’ve said, it’s not just a difference in empirical procedures; it’s a difference in what kind of inferences you are trying to draw (inferences about a result vs. inferences about a theoretical statement). Describing those simply as 2 varieties of the same thing (2 kinds of replication) blurs this important distinction.

I think this means a few important things for how we think about replications:

1. When judging a replication study, the correct comparison is between the original result and the new one. Even if the original study ran a significance test against a null hypothesis of zero effect, that isn’t the test that matters for the replication. There are probably many ways of making this comparison, but within the NHST framework that is familiar to most psychologists, the proper “null hypothesis” to test against is the one that states that the two studies produced the same result.

2. When we observe a difference between a replication and an original study, we should treat that difference as a problem to be solved. Not (yet) as a conclusive statement about the validity of either study. Study 2 didn’t “fail to replicate” Study 1; rather, Studies 1 and 2 produced different results when they should have produced the same, and we now need to figure out what caused that difference.

3. “Conceptual replication” should depend on a foundation of true (“direct”) replicability, not substitute for it. The logic for this is very much like how validity is strengthened by reliability. It doesn’t inspire much confidence in a theory to say that it is supported by multiple lines of evidence if all of those lines, on their own, give results of poor or unknown consistency.

Paul Meehl on replication and significance testing

Still very relevant today.

A scientific study amounts essentially to a “recipe,” telling how to prepare the same kind of cake the recipe writer did. If other competent cooks can’t bake the same kind of cake following the recipe, then there is something wrong with the recipe as described by the first cook. If they can, then, the recipe is all right, and has probative value for the theory. It is hard to avoid the thrust of the claim: If I describe my study so that you can replicate my results, and enough of you do so, it doesn’t matter whether any of us did a significance test; whereas if I describe my study in such a way that the rest of you cannot duplicate my results, others will not believe me, or use my findings to corroborate or refute a theory, even if I did reach statistical significance. So if my work is replicable, the significance test is unnecessary; if my work is not replicable, the significance test is useless. I have never heard a satisfactory reply to that powerful argument.

Meehl, P. E. (1990). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant using it. Psychological Inquiry, 1, 108-141, 173-180. [PDF]

A Pottery Barn rule for scientific journals

Proposed: Once a journal has published a study, it becomes responsible for publishing direct replications of that study. Publication is subject to editorial review of technical merit but is not dependent on outcome. Replications shall be published as brief reports in an online supplement, linked from the electronic version of the original.

*****

I wrote about this idea a year ago when JPSP refused to publish a paper that failed to replicate one of Daryl Bem’s notorious ESP studies. I discovered, immediately after writing up the blog post, that other people were thinking along similar lines. Since then I have heard versions of the idea come up here and there. And strands of it came up again in David Funder’s post on replication (“[replication] studies should, ideally, be published in the same journal that promulgated the original, misleading conclusion”) and the comments to it. When a lot of people are coming up with similar solutions to a problem, that’s probably a sign of something.

Like a lot of people, I believe that the key to improving our science is through incentives. You can finger-wag about the importance of replication all you want, but if there is nowhere to publish and no benefit for trying, you are not going to change behavior. To a large extent, the incentives for individual researchers are controlled through institutions — established journal publishers, professional societies, granting agencies, etc. So if you want to change researchers’ behavior, target those institutions.

Hence a Pottery Barn rule for journals: once you publish a study, you own its replicability (or at least a significant piece of it).

This would change the incentive structure for researchers and for journals in a few different ways. For researchers, there are currently insufficient incentives to run replications. This would give them a virtually guaranteed outlet for publishing a replication attempt. Such publications should be clearly marked on people’s CVs as brief replication reports (probably by giving the online supplement its own journal name, e.g., Journal of Personality and Social Psychology: Replication Reports). That would make it easier for the academic marketplace (like hiring and promotion committees, etc.) to reach its own valuation of such work.

I would expect that grad students would be big users of this opportunity. Others have proposed that running replications should be a standard part of graduate training (e.g., see Matt Lieberman’s idea). This would make it worth students’ while, but without the organizational overhead of Matt’s proposal. The best 1-2 combo, for grad students and PIs alike, would be to embed a direct replication in a replicate-and-extend study. Then if the “extend” part does not work out, the replication report is a fallback (hopefully with a footnote about the failed extend). And if it does, the new paper is a more cumulative contribution than the shot-in-the-dark papers we often see now.

A system like this would change the incentive structure for original studies too. Researchers would know that whatever they publish is eventually going to be linked to a list of replication attempts and their outcomes. As David pointed out, knowing that others will try to replicate your work — and in this proposal, knowing that reports of those attempts would be linked from your own paper! — would undermine the incentives to use questionable research practices far better than any heavy-handed regulatory response. (And if that list of replication attempts is empty 5 years down the road because nobody thinks it’s worth their while to replicate your stuff? That might say something too.)

What about the changed incentives for journals? One benefit would be that the increased accountability for individual researchers should lead to better quality submissions for journals that adopted this policy. That should be a big plus.

A Pottery Barn policy would also increase accountability for journals. It would become much easier to document a journal’s track record of replicability, which could become a counterweight to the relentless pursuit of impact factors. Such accountability would mean a greater emphasis on evaluating replicability during the review process — e.g., to consider statistical power, to let reviewers look at the raw data and the materials and stimuli, etc.

But sequestering replication reports into an online supplement means that the journal’s main mission can stay intact. So if a journal wants to continue to focus on groundbreaking first reports in its main section, it can continue to do so without fearing that its brand will be diluted (though I predict that it would have to accept a lower replication rate in exchange for its focus on novelty).

Replication reports would generate some editorial overhead, but not nearly as much as original reports. They could be published based directly on an editorial decision, or perhaps with a single peer reviewer. A structured reporting format like the one used at Psych File Drawer would make it easier to evaluate the replication study relative to the original. (I would add a field to describe the researchers’ technical expertise and experience with the methods, since that is a potential factor in explaining differences in results.)

Of course, journals would need an incentive to adopt the Pottery Barn rule in the first place. Competition from outlets like PLoS One (which does not consider importance/novelty in its review criteria) or Psych File Drawer (which only publishes replications) might push the traditional journals in this direction. But ultimately it is up to us scientists. If we cite replication studies, if we demand and use outlets that publish them, and if we we speak loudly enough — individually or through our professional organizations — I think the publishers will listen.

Replication, period. (A guest post by David Funder)

The following is a guest post by David Funder. David shares some of his thoughts about the best way forward through social psychology’s recent controversies over fraud and corner-cutting. David is a highly accomplished researcher with a lot of experience in the trenches of psychological science. He is also President-Elect of the Society for Personality and Social Psychology (SPSP), the main organization representing academic social psychologists — but he emphasizes that he is not writing on behalf of SPSP or its officers, and the views expressed in this essay are his own.

*****

Can we believe everything (or anything) that social psychological research tells us? Suddenly, the answer to this question seems to be in doubt. The past few months have seen a shocking series of cases of fraud –researchers literally making their data up — by prominent psychologists at prestigious universities. These revelations have catalyzed an increase in concern about a much broader issue, the replicability of results reported by social psychologists. Numerous writers are questioning common research practices such as selectively reporting only studies that “work” and ignoring relevant negative findings that arise over the course of what is euphemistically called “pre-testing,” increasing N’s or deleting subjects from data sets until the desired findings are obtained and, perhaps worst of all, being inhospitable or even hostile to replication research that could, in principle, cure all these ills.

Reaction is visible. The European Association of Personality Psychology recently held a special three-day meeting on the topic, to result in a set of published recommendations for improved research practice, a well-financed conference in Santa Barbara in October will address the “decline effect” (the mysterious tendency of research findings to fade away over time), and the President of the Society for Personality and Social Psychology was recently motivated to post a message to the membership expressing official concern. These are just three reactions that I personally happen to be familiar with; I’ve also heard that other scientific organizations and even agencies of the federal government are looking into this issue, one way or another.

This burst of concern and activity might seem to be unjustified. After all, literally making your data up is a far cry from practices such as pre-testing, selective reporting, or running multiple statistical tests. These practices are even, in many cases, useful and legitimate. So why did they suddenly come under the microscope as a result of cases of data fraud? The common thread seems to be the issue of replication. As I already mentioned, the idealistic model of healthy scientific practice is that replication is a cure for all ills. Conclusions based on fraudulent data will fail to be replicated by independent investigators, and so eventually the truth will out. And, less dramatically, conclusions based on selectively reported data or derived from other forms of quasi-cheating, such as “p-hacking,” will also fade away over time.

The problem is that, in the cases of data fraud, this model visibly and spectacularly failed. The examples that were exposed so dramatically — and led tenured professors to resign from otherwise secure and comfortable positions (note: this NEVER happens except under the most extreme circumstances) — did not come to light because of replication studies. Indeed, anecdotally — which, sadly, seems to be the only way anybody ever hears of replication studies — various researchers had noticed that they weren’t able to repeat the findings that later turned out to be fraudulent, and one of the fakers even had a reputation of generating data that were “too good to be true.” But that’s not what brought them down. Faking of data was only revealed when research collaborators with first-hand knowledge — sometimes students — reported what was going on.

This fact has to make anyone wonder: what other cases are out there? If literal faking of data is only detected when someone you work with gets upset enough to report you, then most faking will never be detected. Just about everybody I know — including the most pessimistic critics of social psychology — believes, or perhaps hopes, that such outright fraud is very rare. But grant that point and the deeper moral of the story still remains: False findings can remain unchallenged in the literature indefinitely.

Here is the bridge to the wider issue of data practices that are not outright fraudulent, but increase the risk of misleading findings making it into the literature. I will repeat: so-called “questionable” data practices are not always wrong (they just need to be questioned). For example, explorations of large, complex (and expensive) data sets deserve and even require multiple analyses to address many different questions, and interesting findings that emerge should be reported. Internal safeguards are possible, such as split-half replications or randomization analyses to assess the probability of capitalizing on chance. But the ultimate safeguard to prevent misleading findings from permanent residence in (what we think is) our corpus of psychological knowledge is independent replication. Until then, you never really know.

Many remedies are being proposed to cure the ills, or alleged ills, of modern social psychology. These include new standards for research practice (e.g., registering hypotheses in advance of data gathering), new ethical safeguards (e.g., requiring collaborators on a study to attest that they have actually seen the data), new rules for making data publicly available, and so forth. All of these proposals are well-intentioned but the specifics of their implementation are debatable, and ultimately raise the specter of over-regulation. Anybody with a grant knows about the reams of paperwork one now must mindlessly sign attesting to everything from the exact percentage of their time each graduate student has worked on your project to the status of your lab as a drug-free workplace. And that’s not even to mention the number of rules — real and imagined — enforced by the typical campus IRB to “protect” subjects from the possible harm they might suffer from filling out a few questionnaires. Are we going to add yet another layer of rules and regulations to the average over-worked, under-funded, and (pre-tenure) insecure researcher? Over-regulation always starts out well-intentioned, but can ultimately do more harm than good.

The real cure-all is replication. The best thing about replication is that it does not rely on researchers doing less (e.g., running fewer statistical tests, only examining pre-registered hypotheses, etc.), but it depends on them doing more. It is sometimes said the best remedy for false speech is more speech. In the same spirit, the best remedy for misleading research is more research.

But this research needs to be able to see the light of day. Current journal practices, especially among our most prestigious journals, discourage and sometimes even prohibit replication studies from publication. Tenure committees value novel research over solid research. Funding agencies are always looking for the next new thing — they are bored with the “same old same old” and give low priority to research that seeks to build on existing findings — much less seeks to replicate them. Even the researchers who find failures to replicate often undervalue them. I must have done something wrong, most conclude, stashing the study into the proverbial “file drawer” as an unpublishable, expensive and sad waste of time. Those researchers who do become convinced that, in fact, an accepted finding is wrong, are unlikely to attempt to publish this conclusion. Instead, the failure becomes fodder for late-night conversations, fueled by beverages at hotel bars during scientific conferences. There, and pretty much only there, can you find out which famous findings are the ones that “everybody knows” can’t be replicated.

I am not arguing that every replication study must be published. Editors have to use their judgment. Pages really are limited (though less so in the arriving age of electronic publishing) and, more importantly, editors have a responsibility to direct the limited attentional resources of the research community to articles that matter. So any replication study should be carefully evaluated for the skill with which it was conducted, the appropriate level of statistical power, and the overall importance of the conclusion. For example, a solid set of high-powered studies showing that a widely accepted and consequential conclusion was dead wrong, would be important in my book. (So would a series of studies confirming that an important surprising and counter-intuitive finding was actually true. But most aren’t, I suspect.) And this series of studies should, ideally, be published in the same journal that promulgated the original, misleading conclusion. As your mother always said, clean up your own mess.

Other writers have recently laid out interesting, ambitious, and complex plans for reforming psychological research, and even have offered visions of a “research utopia.” I am not doing that here. I only seek to convince you of one point: psychology (and probably all of science) needs more replications. Simply not ruling replication studies as inadmissible out-of-hand would be an encouraging start. Do I ask too much?

Are social psychologists biased against conservatives, or do they just think they are?

A new paper coming out next month by Yoel Inbar and Joris Lammers proposes that some social psychologists discriminate against conservatives in hiring and other professional decisions. Inside Higher Ed has the scoop:

Numerous surveys have found that professors, especially those in some disciplines, are to the left of the general public. But those same — and other — surveys have rarely found evidence that left-leaning academics discriminate on the basis of politics…

A new study, however, challenges that assumption — at least in the field of social psychology… Just over 37 percent of [social psychologists] surveyed said that, given equally qualified candidates for a job, they would support the hiring of a liberal candidate over a conservative candidate. Smaller percentages agreed that a “conservative perspective” would negatively influence their odds of supporting a paper for inclusion in a journal or a proposal for a grant.

Here’s an interesting thing though… social psychology as a field of research is heavily involved in studying implicit biases. And there is a long tradition in social psych of studies showing that people do not have access to the psychological processes that produce these biases and cannot even recognize that they have biases.

Here’s an example of the kind of questions used for evidence of bias:

For the next set of questions, we are interested in what you think YOU WOULD DO in specific situations.

1. If you were reviewing a research grant application that seemed to you to take a politically conservative perspective, do you think this would negatively influence your decision on the grant application?

[Other questions dealt with reviewing papers, hiring, etc. Respondents were given a 7-point scale, and the authors categorized any response at or above the midpoint — labeled “somewhat” — as indicating a willingness to discriminate.]

How is a good social psychologist supposed to answer this question? If you believe in the IAT, Wilson & Nisbett, etc. and you are committed to trying to give the most accurate answer that you can, then I think one very defensible conclusion to derive from those theories is that yes, you are at least somewhat likely to discriminate. And because your training would tell you to be skeptical of your intuitions and introspections, you could reach that conclusion even if you fervently believe you would never intentionally discriminate, and even if your past behavior has always appeared (to you) to be completely fair.

Is that what some of the respondents were doing — giving expert predictions rather than personal responses? I have no idea. But I find it hard to rule out. And I think it’s enough of a possibility to raise serious concerns about labeling responses to that question as “willingness to discriminate.” Many social psychologists who study implicit bias believe that a lot of discriminatory behavior happens apart from, or even in opposition to, what people are “willing” (intending) to do. And the survey question doesn’t ask about willingness, it asks about probable behavior (“what you think YOU WOULD DO,” caps in the original). To a layperson, that distinction might seem like hairsplitting. To a social psychologist, the difference is huge.

None of this is to knock the survey itself. I think Inbar and Lammers have given us a useful window into what social psychologists believe about political bias in their field (and as an aside, there’s lots of other interesting stuff in that paper besides the discrimination questions). But I’m not convinced that one can make a straightforward leap to inferring discriminatory behavior from this survey. Like a lot of research, the study raises more questions than it answers, and begs for followup studies with behavioral outcomes. My personal hunch is that it’s plausible that social psychologists’ political beliefs do influence their professional decisions. But if I put my scientist glasses on and evaluate this survey as a piece of empirical research, I’m just not sure that it really pins down a clear answer yet.

—–

Inbar, Y., & Lammers, J. (in press). Political diversity in social and personality psychology. Perspectives on Psychological Science. Working paper available here.

From Walter Stewart to Uri Simonsohn

Over on G+, Ole Rogeberg asks what ever happened to Walter Stewart? Stewart was a biologist employed by NIH in the 80s and 90s who became involved in rooting out questionable research practices.

Rogeberg posts an old Omni Magazine interview with Stewart (from 1989) in which Stewart describes how he got involved in investigating fraud and misconduct and what led him to think that it was more widespread than many scientists were willing to acknowledge. If you have been following the fraud scandals in psychology and the work of Uri Simonsohn, you should read it. It is completely riveting. And I found some of the parallels to be uncanny.

For example, on Stewart’s first investigation of questionable research, one of the clues that raised his suspicions was a pattern of too-similar means in a researcher’s observations. Similar problems — estimates closer together than what would be expected by chance — led Simonsohn to finger 2 researchers for misconduct.

And anticipating contemporary calls for more data openness — including the title of Simonsohn’s working paper, “Just Post It,” Stewart writes:

“With present attitudes it’s difficult for an outsider to ask for a scientist’s raw data without appearing to question that person’s integrity. But that attitude absolutely has to change… Once you publish a paper, you’re in essence giving its ideas away. In return for benefits you gain from that – fame, recognition, or whatever – you should be willing to make your lab records and data available.”

Some of the details of how Stewart’s colleagues responded are also alarming. His boss at NIH mused publicly on why he was wasting his talents chasing fraud. Others were even less kind, calling him “the terrorist of the lab.” And when he got into a dispute with his suburban neighbors about not mowing his lawn, Science — yes, that Science — ran a gossip piece on the spat. (Some of the discussions of Simonsohn’s earlier data-detecting efforts have gotten a bit heated, but I haven’t seen anything get that far yet. Let’s hope there aren’t any other social psychologists on the board of his HOA.)

The Stewart interview brought home for me just how much these issues are perennial, and perhaps structural. But the difference from 23 years ago is that we have better tools for change. Journal editors’ gatekeeping powers are weakening in the face of open-access journals and post-publication review.

Will things change for the better? I don’t know. I feel like psychology has an opportunity right now. Maybe we’ll actually step back, have a difficult conversation about what really needs to be done, and make some changes. If not, I bet it won’t be 20 years before the next Stewart/Simonsohn comes along.