Imagine that you have entered a charity drawing to win a free iPad. The charity organizer draws a ticket, and it’s your number. Hooray! But wait, someone else is cheering too. After a little investigation it turns out that due to a printing error, two different tickets had the same winning number. You don’t want to be a jerk and make the charity buy another iPad, and you can’t saw it in half. So you have to decide who gets the iPad.
Suppose that someone proposes to flip a coin to decide who gets the iPad. Sounds pretty fair, right?
But suppose that the other guy with a winning ticket — let’s call him Pete — instead proposes the following procedure. First the organizer will flip a coin. If Pete wins that flip, he gets the iPad. But if you win the flip, then the organizer will toss the coin 2 more times. If Pete wins best out of 3, he gets the iPad. If you win best out of 3, then the organizer will flip yet another 2 times. If Pete wins the best out of those (now) 5 flips, he gets the iPad. If not, keep going… Eventually, if Pete gets tired and gives up before he wins the iPad, you can have it.
Doesn’t sound so fair, does it?
The procedure I just described is not all that different from the research practice of data peeking. Data peeking goes something like this: you run some subjects, then do an analysis. If it comes out significant, you stop. If not, you run some more subjects and try again. What Peeky Pete’s iPad Procedure and data-peeking have in common is that you are starting with a process that includes randomness (coin flips, or the random error in subjects’ behavior) but then using a biased rule to stop the random process when it favors somebody’s outcome. Which means that the “randomness” is no longer random at all.
Statisticians have been studying the consequences of data-peeking for a long time (e.g., Armitage et al., 1969). But the practice has received new attention recently in psychology, in large part because of the Simmons et al. false-positive psychology paper that came out last year. Given this attention, it is fair to wonder (1) how common is data-peeking, and (2) how bad is it?
How common is data-peeking?
Anecdotally, lot of people seem to think data peeking is common. Tal Yarkoni described data peeking as “a time-honored tradition in the social sciences.” Dave Nussbaum wrote that “Most people don’t realize that looking at the data before collecting all of it is much of a problem,” and he says that until recently he was one of those people. Speaking from my own anecdotal experience, ever since Simmons et al. came out I’ve had enough casual conversations with colleagues in social psychology that have brought me around to thinking that data peeking is not rare. And in fact, I have talked to more than one fMRI researcher who considers data peeking not only acceptable but beneficial (more on that below).
More formally, when Leslie John and others surveyed academic research psychologists about questionable research practices, a majority (55%) outright admitted that they have “decid[ed] whether to collect more data after looking to see whether the results were significant.” John et al. use a variety of techniques to try to correct for underreporting; they estimate the real prevalence to be much higher. On the flip side, it is at least a little ambiguous whether some respondents might have interpreted “deciding whether to collect more data” to include running a new study, rather than adding new subjects to an existing one. But the bottom line is that data-peeking does not seem to be at all rare.
How bad is it?
You might be wondering, is all the fuss about data peeking just a bunch of rigid stats-nerd orthodoxy, or does it really matter? After all, statisticians sometimes get worked up about things that don’t make much difference in practice. If we’re talking about something that turns a 5% Type I error rate into 6%, is it really a big deal?
The short answer is yes, it’s a big deal. Once you start looking into the math behind data-peeking, it quickly becomes apparent that it has the potential to seriously distort results. Exactly how much depends on a lot of factors: how many cases you run before you take your first peek, how frequently you peek after that, how you decide when to keep running subjects and when to give up, etc. But a good and I think realistic illustration comes from some simulations that Tal Yarkoni posted a couple years ago. In one instance, Tal simulated what would happen if you run 10 subjects and then start peeking every 5 subjects after that. He found that you would effectively double your type I error rate by the time you hit 20 subjects. If you peek a little more intensively and run a few more subjects it gets a lot worse. Under what I think are pretty realistic conditions for a lot of psychology and neuroscience experiments, you could easily end up reporting p<.05 when the true false-positive rate is closer to p=20.
And that’s a serious difference. Most researchers would never dream of looking at a p=.19 in their SPSS output and then blatantly writing p<.05 in a manuscript. But if you data-peek enough, that could easily end up being de facto what you are doing, even if you didn’t realize it. As Tal put it, “It’s not the kind of thing you just brush off as needless pedantry.”
So what to do?
These issues are only becoming more timely, given current concerns about replicability in psychology. So what to do about it?
The standard advice to individual researchers is: don’t data-peek. Do a power analysis, set an a priori sample size, and then don’t look at your data until you are done for good. This should totally be the norm in the vast majority of psychology studies.
And to add some transparency and accountability, one of Psychological Science’s proposed disclosure statements would require you to state clearly how you determined your sample size. If that happens, other journals might follow after that. If you believe that most researchers want to be honest and just don’t realize how bad data-peeking is, that’s a pretty good way to spread the word. People will learn fast once their papers start getting sent back with a request to run a replication (or rejected outright).
But is abstinence the only way to go? Some researchers make a cost-benefit case for data peeking. The argument goes as follows: With very expensive procedures (like fMRI), it is wasteful to design high-powered studies if that means you end up running more subjects than you need to determine if there is an effect. (As a sidenote, high-powered studies are actually quite important if you are interested in unbiased (or at least less biased) effect size estimation, but that’s a separate conversation; here I’m assuming you only care about significance.) And on the flip side, the argument goes, Type II errors are wasteful too — if you follow a strict no-data-peeking policy, you might run 20 subjects and get p=.11 and then have to set aside the study and start over from scratch.
Of course, it’s also wasteful to report effects that don’t exist. And compounding that, studies that use expensive procedures are also less likely to get directly replicated, which means that false-positive errors are harder to get found out.
So if you don’t think you can abstain, the next-best thing is to use protection. For those looking to protect their p I have two words: interim analysis. It turns out this is a big issue in the design of clinical trials. Sometimes that is for very similar expense reasons. And sometimes it is because of ethical and safety issues: often in clinical trials you need ongoing monitoring so that you can stop the trial just as soon as you can definitively say that the treatment makes things better (so you can give it to the people in the placebo condition) or worse (so you can call off the trial). So statisticians have worked out a whole bunch of ways of designing and analyzing studies so that you can run interim analyses while keeping your false-positive rate in check. (Because of their history, such designs are sometimes called sequential clinical trials, but that shouldn’t chase you off — the statistics don’t care if you’re doing anything clinical.) SAS has a whole procedure for analyzing them, PROC SEQDESIGN. And R users have lots of options. (I don’t know if these procedures have been worked into fMRI analysis packages, but if they haven’t, they really should be.)
Very little that I’m saying is new. These issues have been known for decades. And in fact, the 2 recommendations I listed above — determine sample size in advance or properly account for interim testing in your analyses — are the same ones Tal made. So I could have saved some blog space and just said “please go read Armitage et al or Todd et al. or Yarkoni & Braver.” (UPDATE via a commenter: or Strube, 2006.)But blog space is cheap, and as far as I can tell word hasn’t gotten out very far. And with (now) readily available tools and growing concerns about replicability, it is time to put uncorrected data peeking to an end.
I’m all for spreading the word. It’s also worth including a concrete discussion of how to do interim analysis in the guide that Psychological Science is proposing, that you blogged about last week. I don’t even think it needs to be the norm to report anything other than the ordinary statistical tests in the main text of a manuscript, but if you did some peeking, then you could add a footnote describing the results of the additional corrective tests there.
I think your point is one worth repeating frequently because as much as it’s obvious that you should never bias your data, there are legitimate reasons for wanting to see if your study is working before data collection is complete. Another small trick that is worth trying when the circumstances allow is to to peek at the data in a very limited way — for instance, if you have a manipulation check, see if that’s working, but don’t check the DV.
I’m not convinced that data-peeking with interim analysis should become a standard approach. I wrote this post from a fairly conventional NHST perspective because that’s the world we live in. But in the bigger picture I think we should be moving toward effect estimation whenever possible. Which means that we should be running larger studies as a matter of routine (because they produce less biased estimates of effects), and data-peeking with interim analysis should be a niche procedure for when cost-benefit analyses justify it.
Your comment about manipulation checks raises a couple of issues. One is that manipulation checks and DVs are correlated, so data-peeking on your manipulation check still introduces bias. An even bigger issue is that we shouldn’t be running experiments with unvalidated manipulations. In the scale construction world it’s widely accepted as best practice to run validation and cross-validation studies before you start using a scale for substantive theory-testing. From a psychometric standpoint the same logic applies to experimental manipulations. But what seems to be common practice is either that people rely on unpublished small-sample pilot studies, or they roll the manipulation check and the hypothesis test into one study with no cross-validation.
What do you think of the approach recommended in Strube (2006): http://www.springerlink.com/content/yq38354457m28m81/
which is to do some kind of post-hoc correction? Then in the journal you would report “we looked at our DV after N subjects and applied a correction…”
I think the its really nice to say “well we should be running larger studies”–but the truth is that subjects are still very hard to come by especially if you want to do lab studies with complex protocols. there is a cost benefit involved in running more subjects versus disengaging to work on something else; especially considering that an academic psychologist’s career is basically a race for publications and resources for most people are limited.
Hey, cool! I didn’t know about that Strube paper, but it gives a really good overview of the consequences of different kinds of data peeking. (I’m a Mac guy so I didn’t download the software. But it probably wouldn’t be too hard for someone to take the procedure Strube describes in the paper and write up an R program to do something similar.)
We absolutely need to be realistic that researchers operate under incentives and constraints. But I’ve seen plenty of studies with inexpensive, easy procedures with n=14 per cell for no good reason. And in instances where cost-benefit considerations might justify small N’s, plenty of overclaiming about effect sizes and robustness. So when I say we need to run larger studies, part of that means that people in charge at journals, granting agencies, etc. need to better consider power when evaluating completed and proposed research. That includes being willing to send papers back when they show a cool result with a too-small sample. It also includes rewarding authors (rather than punishing them) for being forthright about the limitations to what one can conclude from expensive, small-N studies.