Some thoughts on replication and falsifiability: Is this a chance to do better?

Most psychologists would probably endorse falsification as an important part of science. But in practice we rarely do it right. As others have observed before me, we do it backwards. Instead of designing experiments to falsify the hypothesis we are testing, we look for statistical evidence against a “nil null” — the point prediction that the true effect is zero. Sometimes the nil null is interesting, sometimes it isn’t, but it’s almost never a prediction from the theory that we are actually hoping to draw conclusions about.

The more rigorous approach is to derive a quantitative prediction from a theory. Then you design an experiment where the prediction could fail if the theory is wrong. Statistically speaking, the null hypothesis should be the prediction from your theory (“when dropped, this object will accelerate toward the earth at 9.8 m/s^2”). Then if a “significant” result tells you that the data are inconsistent with the theory (“average measured acceleration was 8.6 m/s^2, which differs from 9.8 at p < .05″), you have to either set aside the theory itself or one of the supporting assumptions you made when you designed the experiment. You get some leeway to look to the supporting assumptions (“oops, 9.8 assumes no wind resistance”), but not endless leeway — if the predictions keep failing, eventually you have to face facts and walk away from your theory. On the flip side, a theory is corroborated when it survives many risky opportunities to fail.

The problem in psychology — and many other sciences, including quite a bit of biology and medicine — is that our theories rarely make specific enough quantitative predictions to do hypothesis testing the “right” way. Few of our theories lead to anything remotely close to “g = 9.8 m/s^2” in specificity. People sometimes suggest this is a problem with psychologists’ acumen as theorists. I am more inclined to think it is a function of being a young science and having chosen very difficult problems to solve. So in the grand scheme, I don’t think we should self-flagellate too much about being poor theorists or succumb to physics envy. Most of the time I am inclined to agree with people Paul Rozin (who was agreeing with Solomon Asch) and William McGuire that instead we need to adapt our approach to our scientific problems and current state of knowledge, rather than trying to ape a caricature of “hard” science. That requires changes in how we do science: we need more exploration and discovery to accumulate interesting knowledge about our phenomena, and we need to be more modest and conditional in our theories. It would be a mistake to say we need to simply double down on the caricature.

So with all this being said, there is something really interesting and I think under-appreciated about the recent movement toward replication, and it is this: This may be a great opportunity to do falsification better.

The repeatability theory

Every results section says some version of, “We did this experiment and we observed these results.”[1] It is a specific statement about something that happened in the past. But hand-in-hand with that statement is, implicitly, another claim: “If someone does the same experiment again, they will get the same results.” The second claim is a mini-theory: it is a generalization of the first claim. Call it the repeatability theory. Every experimental report comes with its own repeatability theory. It is a necessary assumption of inferential statistics. And if we did not make it, we would be doing history rather than science.

And here’s the thing: the repeatability theory is very falsifiable. The rigorous, strong kind of falsifiable. We just need to clarify what it means to (A) do the same experiment again and (B) observe the same or different results.

Part B is a little easier. “The same results” does not mean exactly the same results to infinite precision. It means “the same results plus or minus error.” The hypothesis is that Experiment 1 (the original) and Experiment 2 (the replication) are observations with error of the same underlying effect, so any observed differences between experiments are just noise. If you are using NHST[2] that leads to a straightforward “strong” null hypothesis: effectsize_1 = effectsize_2. If you have access to all the raw data, you can combine both experiments into a single dataset, create an indicator variable for which study the effect came from, and test the interaction of that indicator with the effect. The null hypothesis is no interaction, which sounds like the old fashioned nil-null but in fact “interaction = 0” is the same as saying the effects are equal, which is the very specific quantitative hypothesis derived from the repeatability theory. If you don’t have the raw data, don’t despair. You can calculate an effect from each experiment and then compare them, like with a test of independent correlations. You can and should also estimate the difference between effects (effectsize_1 – effectsize_2) and an associated confidence interval. That difference is itself an effect size: it quantifies whatever difference there is between the studies, and can tell you if the difference is large or trivial.

Part A, “do the same experiment again,” is more complicated. Literalists like to point out that you will never be in the same room, with the same weather outside, with the same RA wearing the same shirt, etc. etc. They are technically right about all of that.[3]

But the realistic answer is that “the same experiment” just has to repeat the things that matter. “What matters” has been the subject of some discussion recently, for example in a published commentary by Danny Kahneman and a blog post by Andrew Wilson. In my thinking you can divide “what matters” into 3 categories: the original researchers’ specification of the experiment, technical skills in the methods used, and common sense. The onus is on the original experimenter to be able to tell a competent colleague what is necessary to repeat the experiment. In the old days of paper journals and page counts, it was impossible for most published papers to do this completely and you needed a lot of backchannel communication. With online supplements the gap is narrowing, but I still think it can’t hurt for a replicator to reach out to an original author. (Though in contrast to Kahneman, I would describe this as a methodological best practice, neither a matter of etiquette nor an absolute requirement.) If researchers say they do not know what conditions are necessary to produce an effect, that is no defense. It should undermine our faith in the original study. Don’t take my word for it, here’s Sir Karl (whose logic is better than his language – this is [hopefully obviously] limited neither to men nor physicists):

Every experimental physicist knows those surprising and inexplicable apparent ‘effects’ which in his laboratory can perhaps even be reproduced for some time, but which finally disappear without trace. Of course, no physicist would say in such a case that he had made a scientific discovery (though he might try to rearrange his experiments so as to make the effect reproducible). Indeed the scientifically significant physical effect may be defined as that which can be regularly reproduced by anyone who carries out the appropriate experiment in the way prescribed. No serious physicist would offer for publication, as a scientific discovery, any such ‘occult effect,’ as I propose to call it – one for whose reproduction he could give no instructions. (Karl Popper, The Logic of Scientific Discovery, pp. 23-24)

Interpreting results

What happens when the data are inconsistent with the repeatability theory – original != replication? As with all empirical results, we have to consider multiple interpretations. This is true in all of science and has been recognized for a long time; replications are not special in this regard. An observed discrepancy between the original result and a replication[4] is an empirical finding that needs to be interpreted like any other empirical finding. However, a few issues come up commonly in interpreting replications:

First vs. latest. There is nothing special about an experiment being either the first or the latest, ceteris paribus. However ceteris is rarely paribus. If the replication has more power or if the scientific community gets to see its results through a less biased process than the original (e.g., due to pre-registration or a results-independent publication process), those things should give it more weight.

Technical skills. A technical analysis of the methods used and labs’ track records with them is appropriate. I am not much swayed by broad appeals to experimental “artistry.” Instead, I find these interpretations more persuasive when someone can put forward a plausible candidate for something important in the original that is not easy to standardize or carry off without specific skills. For example, a computer-administered experiment is possible to standardize and audit (and in some cases the code and digital stimuli can be reproduced exactly). But an experiment that involves confederates or cover stories might be harder to pull off for a lab that does not do that routinely. When that is the case, manipulation checks, lab visits/exchanges (in person or through video), and other validation procedures become important.

Moderators. Replications can never reproduce every single aspect of the original study. They do their best to reproduce everything that the original specification, technical knowledge, and common sense say should matter. But they can and will still depart from original studies in any number of ways: the subject pool being drawn from, the local social and cultural context, procedural changes made for practical reasons, etc. When the replication departs substantially from the original, it is fair to consider possible moderators. But moderator interpretations are nearly always post hoc, and should be weighed accordingly until we have more data.

I think it’s also important to point out that the possibility of unanticipated moderators is not a problem with replications; rather, if you are interested in discovery it is a very good reason to run them. Consider a hypothetical example from a recent blog post by Tim Wilson: a study originally run in the laboratory that produces a smaller effect in an online replication. Wilson imagines this is an outcome that a replicator with improbable amounts of both malevolence and prescience might arrange on purpose. But a far more likely scenario is that if the original specification, technical knowledge, and common sense all say that offline-online shouldn’t matter but it turns out that it does, that could actually be a very interesting discovery! People are living more of their lives online, and it is important to know how social cognition and behavior work in virtual spaces. And a discovery like that might also save other scientists a lot of wasted effort and resources, if for example they thought the experiment would work online and planned to run replicate-and-extend studies or adapt parts of the original procedure for new studies. In the end, Wilson’s example of replication gone wrong looks more like a useful discovery.

Discovery and replication need each other

Discovery and replication are often contrasted with each other. Discovery is new and exciting; replication is dull “duplication.” But that is silly. Replication separates real discoveries from noise-surfing, and as just noted it can itself lead to discoveries. We can and should do both. And not just in some sort of division of labor arrangement, but in an integrated way as part of our science. Exciting new discoveries need to be replicated before we take them as definitive. Replication within and between labs should be routine and normal.

An integrated discovery-replication approach is also an excellent way to build theories. Both Rozin and McGuire criticize psychology’s tendency to equate “theory” with broad, decontextualized statements – pronouncements that almost invariably get chipped away in subsequent studies as we discover moderators and boundary conditions. This kind of “overclaim first, then back away slowly” approach supports the hype cycle and means that a tendency to make incorrect statements is baked in to our research process. Instead, Rozin wants us to accumulate interesting descriptive facts about the phenomena we are studying; McGuire wants us to study how effects vary over populations and contexts. A discovery-replication approach allows us to do this both of these things. We can use discovery-oriented exploratory research to derive truly falsifiable predictions to then be tested. That way we will amass a body of narrow but well-corroborated theoretical statements (the repeatability theories) to assemble into bigger theories from the foundations up, rather than starting with bold pronouncements. We will also build up knowledge about quantitative estimates of effects, which we can use to start to make interval and even point predictions. That kind of cumulative science is likely to generate fewer sexy headlines in the short run, but it will be a whole lot more durable.

—–

1. I am using “experiment” in the very broad sense here of a structured scientific observation, not the more limited sense of a study that involves randomized manipulation by an experimenter.[5]

2. I’m sure the Bayesians have an answer for the statistical problem too. It is probably a good one. But c’mon, this is a chance to finally do NHST right!

3. Literalists also like to say it’s a problem that you will never have the exact same people as subjects again. They are technically wrong about that being a problem. “Drawing a sample” is part of what constitutes the experiment. But pressing this point will get you into an argument with a literalist over a technicality, which is never fun, so I suggest letting it drop.

4. “Discrepancy” = “failed replication” in the parlance of our time, but I don’t like that phrase. Who/what failed? Totally unclear, and the answer may be nobody/nothing.

5. I am totally ripping this footnote thing off of Simine Vazire but telling myself I’m ripping off David Foster Wallace.

The flawed logic of chasing large effects with small samples

“I don’t care about any effect that I need more than 20 subjects per cell to detect.”

I have heard statements to this effect a number of times over the years. Sometimes from the mouths of some pretty well-established researchers, and sometimes from people quoting the well-established researchers they trained under. The idea is that if an effect is big enough — perhaps because of its real-world importance, or because of the experimenter’s skill in isolating and amplifying the effect in the lab — then you don’t need a big sample to detect it.

When I have asked people why they think that, the reasoning behind it goes something like this. If the true effect is large, then even a small sample will have a reasonable chance of detecting it. (“Detecting” = rejecting the null in this context.) If the true effect is small, then a small sample is unlikely to reject the null. So if you only use small samples, you will limit yourself to detecting large effects. And if that’s all you care about detecting, then you’re fine with small samples.

On first consideration, that might sound reasonable, and even admirably aware of issues of statistical power. Unfortunately it is completely wrong. Some of the problems with it are statistical and logical. Others are more practical:

1. It involves a classic error in probabilistic thinking. In probability notation, you are incorrectly assuming that P(A|B) = P(B|A). In psychology jargon, you are guilty of base rate insensitivity.

Here’s why. A power analysis tells the following: for a given sample size, IF the effect is large (or small) THEN you will have a certain probability of rejecting the null. But that probability is not the same as the probability that IF you have rejected the null THEN the effect is large (or small). The latter probability depends both on power and on how common large and small effects are to begin with — what a statistician would call the prior probability, and a psychologist would call the base rate.

To put this in context, suppose an experimenter is working in an area where most effects are small, some are medium, and a few are large (which pretty well describes the field of social psychology as a whole). The experimenter does not know in advance which it is, of course. When it turns out that the experimenter has stumbled onto one of the occasional large effects, the test will probably be significant. But more often the true effect will be small. Some of those will be “missed” but there will be so many of them (relative to the number of experiments run) that they’ll end up being the majority.

Consider a simplified numerical example with just 2 possibilities. Suppose that 10% of experiments are chasing an effect that is so big there’s an 90% chance of detecting it (.90 power), and 90% of experiments are chasing smaller effects with a 40% chance. Out of 100 experiments, the experimenter will get 9 significant results from the large effects, and 36 significant results from the small effects. So most of the significant results (80% of them) will come from having gotten a little bit lucky with small effects, rather than having nailed a big effect.

(Of course, that simplified example assumes, probably too generously, that the null is never true or close-to-true. Moreover, with small samples the most common outcome — absent any p-hacking — will be that the results are not significant at all, even when there really is an effect.)

If a researcher really is only interested in identifying effects of a certain size, it might seem reasonable to calculate effect sizes. But with small samples, the researcher will greatly overestimate effect sizes in the subset of results that are significant, and the amount of bias will be the greatest when the true effects are small. That is because in order to show up as significant, those small (true) effects will need to have been helped along by sampling error, giving them a positive bias as a group. So the data won’t be much help in distinguishing truly large effects from the times when chance helped along a small one. They’ll look much the same.

2. Behind the “I only need small samples” argument is the idea that the researcher attaches some special interpretive value (theoretical meaning or practical significance) to effects that are at least some certain size. But if that is the case, then the researcher needs to adjust how he or she does hypothesis testing. In conventional NHST, the null hypothesis you are trying to rule out states that the effect is zero. But if you only care about medium-and-larger effects, then you need to go a step further and rule out the hypothesis that the effect is small. Which is entirely possible to do, and not that difficult mathematically. But in order to differentiate a medium effect from a small one, you need statistical power. n=20 per cell won’t cut it.

(This situation is in some ways the mirror opposite of when researchers say they only care about the direction of an effect, not its size. But even if nothing substantive is riding on the effect size, most effects turn out to be small in size, and the experiment is only worth doing if it is reasonably capable of detecting something.)

3. All of the above assumes optimal scientific practice — no researcher degrees of freedom, no publication bias, etc. In the real world, of course, things are rarely optimal.

When it comes to running studies with small samples, one of the biggest risks is engaging in data churning and capitalizing on chance — even inadvertently. Consider two researchers who are planning to run 2-condition experiments in which a significant result would be theoretically interesting and publishable. Both decide to dedicate the time and resources to running up to 200 subjects on the project before giving up and moving on. Researcher A runs one well-powered study with n=100 per condition. Researcher B decides to run the experiment with n=20 per condition. If it is significant, B will stop and write up the result; if it is not significant, B tweaks the procedure and tries again, up to 5 times.

Without realizing it, Researcher B is working with an effective Type I error rate that is about 23%.

Of course, Researcher B should also recognize that this is a cross-validation situation. By tweaking and repeating (or merely being willing to tweak and repeat if the result is not significant), B is engaged in exploratory research. To guard against capitalizing on chance, the variation on the experiment that finally “works” needs to be run again. That is actually true regardless of sample size or power. But the practical consequences of violating it are bigger and bigger as samples get smaller and smaller.

The upshot of all of this is pretty straightforward: you cannot talk your way out of running studies with adequate power. It does not matter if you only care about large effects, if you care about small effects too, or if you do not care about effect sizes at all. It doesn’t even matter if you focus on estimation rather than NHST (which I wholeheartedly support by the way) — you still need adequate samples. The only alternatives are (a) to live with a lot of ambiguous (nonsignificant) results until enough accrue to do a meta-analysis, or (b) or p-hack your way out of them.

Changing software to nudge researchers toward better data analysis practice

The tools we have available to us affect the way we interact with and even think about the world. “If all you have is a hammer” etc. Along these lines, I’ve been wondering what would happen if the makers of data analysis software like SPSS, SAS, etc. changed some of the defaults and options. Sort of in the spirit of Nudge — don’t necessarily change the list of what is ultimately possible to do, but make changes to make some things easier and other things harder (like via defaults and options).

Would people think about their data differently? Here’s my list of how I might change regression procedures, and what I think these changes might do:

1. Let users write common transformations of variables directly into the syntax. Things like centering, z-scoring, log-transforming, multiplying variables into interactions, etc. This is already part of some packages (it’s easy to do in R), but not others. In particular, running interactions in SPSS is a huge royal pain. For example, to do a simple 2-way interaction with centered variables, you have to write all this crap *and* cycle back and forth between the code and the output along the way:

desc x1 x2.
* Run just the above, then look at the output and see what the means are, then edit the code below.
compute x1_c = x1 - [whatever the mean was].
compute x2_c = x2 - [whatever the mean was].
compute x1x2 = x1_c*x2_c.
regression /dependent y /enter x1_c x2_c x1x2.

Why shouldn’t we be able to do it all in one line like this?

regression /dependent y /enter center(x1) center(x2) center(x1)*center(x2).

The nudge: If it were easy to write everything into a single command, maybe more people would look at interactions more often. And maybe they’d stop doing median splits and then jamming everything into an ANOVA!

2. By default, the output shows you parameter estimates and confidence intervals.

3. Either by default or with an easy-to-implement option, you can get a variety of standardized effect size estimates with their confidence intervals. And let’s not make variance-explained metrics (like R^2 or eta^2) the defaults.

The nudge: #2 and #3 are both designed to focus people on point and interval estimation, rather than NHST.

This next one is a little more radical:

4. By default the output does not show you inferential t-tests and p-values — you have to ask for them through an option. And when you ask for them, you have to state what the null hypotheses are! So if you want to test the null that some parameter equals zero (as 99.9% of research in social science does), hey, go for it — but it has to be an active request, not a passive default. And if you want to test a null hypothesis that some parameter is some nonzero value, it would be easy to do that too.

The nudge. In the way a lot of statistics is taught in psychology, NHST is the main event and effect estimation is an afterthought. This would turn it around. And by making users specify a null hypothesis, it might spur us to pause and think about how and why we are doing so, rather than just mining for asterisks to put in tables. Heck, I bet some nontrivial number of psychology researchers don’t even know that the null hypothesis doesn’t have to be the nil hypothesis. (I still remember the “aha” feeling the first time I learned that you could do that — well along into graduate school, in an elective statistics class.) If we want researchers to move toward point or range predictions with strong hypothesis testing, we should make it easier to do.

All of these things are possible to do in most or all software packages. But as my SPSS example under #1 shows, they’re not necessarily easy to implement in a user-friendly way. Even R doesn’t do all of these things in the standard lm function. As  a result, they probably don’t get done as much as they could or should.

Any other nudges you’d make?

Data peeking is always wrong (except when you do it right)

Imagine that you have entered a charity drawing to win a free iPad. The charity organizer draws a ticket, and it’s your number. Hooray! But wait, someone else is cheering too. After a little investigation it turns out that due to a printing error, two different tickets had the same winning number. You don’t want to be a jerk and make the charity buy another iPad, and you can’t saw it in half. So you have to decide who gets the iPad.

Suppose that someone proposes to flip a coin to decide who gets the iPad. Sounds pretty fair, right?

But suppose that the other guy with a winning ticket — let’s call him Pete — instead proposes the following procedure. First the organizer will flip a coin. If Pete wins that flip, he gets the iPad. But if you win the flip, then the organizer will toss the coin 2 more times. If Pete wins best out of 3, he gets the iPad. If you win best out of 3, then the organizer will flip yet another 2 times. If Pete wins the best out of those (now) 5 flips, he gets the iPad. If not, keep going… Eventually, if Pete gets tired and gives up before he wins the iPad, you can have it.

Doesn’t sound so fair, does it?

The procedure I just described is not all that different from the research practice of data peeking. Data peeking goes something like this: you run some subjects, then do an analysis. If it comes out significant, you stop. If not, you run some more subjects and try again. What Peeky Pete’s iPad Procedure and data-peeking have in common is that you are starting with a process that includes randomness (coin flips, or the random error in subjects’ behavior) but then using a biased rule to stop the random process when it favors somebody’s outcome. Which means that the “randomness” is no longer random at all.

Statisticians have been studying the consequences of data-peeking for a long time (e.g., Armitage et al., 1969). But the practice has received new attention recently in psychology, in large part because of the Simmons et al. false-positive psychology paper that came out last year. Given this attention, it is fair to wonder (1) how common is data-peeking, and (2) how bad is it?

How common is data-peeking?

Anecdotally, lot of people seem to think data peeking is common. Tal Yarkoni described data peeking as “a time-honored tradition in the social sciences.” Dave Nussbaum wrote that “Most people don’t realize that looking at the data before collecting all of it is much of a problem,” and he says that until recently he was one of those people. Speaking from my own anecdotal experience, ever since Simmons et al. came out I’ve had enough casual conversations with colleagues in social psychology that have brought me around to thinking that data peeking is not rare. And in fact, I have talked to more than one fMRI researcher who considers data peeking not only acceptable but beneficial (more on that below).

More formally, when Leslie John and others surveyed academic research psychologists about questionable research practices, a majority (55%) outright admitted that they have “decid[ed] whether to collect more data after looking to see whether the results were significant.” John et al. use a variety of techniques to try to correct for underreporting; they estimate the real prevalence to be much higher. On the flip side, it is at least a little ambiguous whether some respondents might have interpreted “deciding whether to collect more data” to include running a new study, rather than adding new subjects to an existing one. But the bottom line is that data-peeking does not seem to be at all rare.

How bad is it?

You might be wondering, is all the fuss about data peeking just a bunch of rigid stats-nerd orthodoxy, or does it really matter? After all, statisticians sometimes get worked up about things that don’t make much difference in practice. If we’re talking about something that turns a 5% Type I error rate into 6%, is it really a big deal?

The short answer is yes, it’s a big deal. Once you start looking into the math behind data-peeking, it quickly becomes apparent that it has the potential to seriously distort results. Exactly how much depends on a lot of factors: how many cases you run before you take your first peek, how frequently you peek after that, how you decide when to keep running subjects and when to give up, etc. But a good and I think realistic illustration comes from some simulations that Tal Yarkoni posted a couple years ago. In one instance, Tal simulated what would happen if you run 10 subjects and then start peeking every 5 subjects after that. He found that you would effectively double your type I error rate by the time you hit 20 subjects. If you peek a little more intensively and run a few more subjects it gets a lot worse. Under what I think are pretty realistic conditions for a lot of psychology and neuroscience experiments, you could easily end up reporting p<.05 when the true false-positive rate is closer to p=20.

And that’s a serious difference. Most researchers would never dream of looking at a p=.19 in their SPSS output and then blatantly writing p<.05 in a manuscript. But if you data-peek enough, that could easily end up being de facto what you are doing, even if you didn’t realize it. As Tal put it, “It’s not the kind of thing you just brush off as needless pedantry.”

So what to do?

These issues are only becoming more timely, given current concerns about replicability in psychology. So what to do about it?

The standard advice to individual researchers is: don’t data-peek. Do a power analysis, set an a priori sample size, and then don’t look at your data until you are done for good. This should totally be the norm in the vast majority of psychology studies.

And to add some transparency and accountability, one of Psychological Science’s proposed disclosure statements would require you to state clearly how you determined your sample size. If that happens, other journals might follow after that. If you believe that most researchers want to be honest and just don’t realize how bad data-peeking is, that’s a pretty good way to spread the word. People will learn fast once their papers start getting sent back with a request to run a replication (or rejected outright).

But is abstinence the only way to go? Some researchers make a cost-benefit case for data peeking. The argument goes as follows: With very expensive procedures (like fMRI), it is wasteful to design high-powered studies if that means you end up running more subjects than you need to determine if there is an effect. (As a sidenote, high-powered studies are actually quite important if you are interested in unbiased (or at least less biased) effect size estimation, but that’s a separate conversation; here I’m assuming you only care about significance.) And on the flip side, the argument goes, Type II errors are wasteful too — if you follow a strict no-data-peeking policy, you might run 20 subjects and get p=.11 and then have to set aside the study and start over from scratch.

Of course, it’s also wasteful to report effects that don’t exist. And compounding that, studies that use expensive procedures are also less likely to get directly replicated, which means that false-positive errors are harder to get found out.

So if you don’t think you can abstain, the next-best thing is to use protection. For those looking to protect their p I have two words: interim analysis. It turns out this is a big issue in the design of clinical trials. Sometimes that is for very similar expense reasons. And sometimes it is because of ethical and safety issues: often in clinical trials you need ongoing monitoring so that you can stop the trial just as soon as you can definitively say that the treatment makes things better (so you can give it to the people in the placebo condition) or worse (so you can call off the trial). So statisticians have worked out a whole bunch of ways of designing and analyzing studies so that you can run interim analyses while keeping your false-positive rate in check. (Because of their history, such designs are sometimes called sequential clinical trials, but that shouldn’t chase you off — the statistics don’t care if you’re doing anything clinical.) SAS has a whole procedure for analyzing them, PROC SEQDESIGN. And R users have lots of options. (I don’t know if these procedures have been worked into fMRI analysis packages, but if they haven’t, they really should be.)

Very little that I’m saying is new. These issues have been known for decades. And in fact, the 2 recommendations I listed above — determine sample size in advance or properly account for interim testing in your analyses — are the same ones Tal made. So I could have saved some blog space and just said “please go read Armitage et al or Todd et al. or Yarkoni & Braver.” (UPDATE via a commenter: or Strube, 2006.)But blog space is cheap, and as far as I can tell word hasn’t gotten out very far. And with (now) readily available tools and growing concerns about replicability, it is time to put uncorrected data peeking to an end.

What counts as a successful or failed replication?

Let’s say that some theory states that people in psychological state A1 will engage in behavior B more than people in psychological state A2. Suppose that, a priori, the theory allows us to make this directional prediction, but not a prediction about the size of the effect.

A researcher designs an experiment — call this Study 1 — in which she manipulates A1 versus A2 and then measures B. Consistent with the theory, the result of Study 1 shows more of behavior B in condition A1 than A2. The effect size is d=0.8 (a large effect). A null hypothesis significance test shows that the effect is significantly different from zero, p<.05.

Now Researcher #2 comes along and conducts Study 2. The procedures of Study 2 copy Study 1 as closely as possible — the same manipulation of A, the same measure of B, etc. The result of Study 2 shows more of behavior B in condition A1 than in A2 — same direction as Study 1. In Study 2, the effect size is d=0.3 (a smallish effect). A null hypothesis significance test shows that the effect is significantly different from zero, p<.05. But a comparison of the Study 1 effect to the Study 2 effect (d=0.8 versus d=0.3) is also significant, p<.05.

Here’s the question: did Study 2 successfully replicate Study 1?

My answer is no. Here’s why. When we say “replication,” we should be talking about whether we can reproduce a result. A statistical comparison of Studies 1 and 2 shows that they gave us significantly different results. We should be bothered by the difference, and we should be trying to figure out why.

People who would call Study 2 a “successful” replication of Study 1 are focused on what it means for the theory. The theoretical statement that inspired the first study only spoke about direction, and both results came out in the same direction. By that standard you could say that it replicated.

But I have two problems with defining replication in that way. My first problem is that, after learning the results of Study 1, we had grounds to refine the theory to include statements about the likely range of the effect’s size, not just its direction. Those refinements might be provisional, and they might be contingent on particular conditions (i.e., the experimental conditions under which Study 1 was conducted), but we can and should still make them. So Study 2 should have had a different hypothesis, a more focused one, than Study 1. Theories should be living things, changing every time they encounter new data. If we define replication as testing the theory twice then there can be no replication, because the theory is always changing.

My second problem is that we should always be putting theoretical statements to multiple tests. That should be such normal behavior in science that we shouldn’t dilute the term “replication” by including every possible way of doing it. As Michael Shermer once wrote, “Proof is derived through a convergence of evidence from numerous lines of inquiry — multiple, independent inductions all of which point to an unmistakable conclusion.” We should all be working toward that goal all the time.

This distinction — between empirical results vs. conclusions about theories — goes to the heart of the discussion about direct and conceptual replication. Direct replication means that you reproduce, as faithfully as possible, the procedures and conditions of the original study. So the focus should rightly be on the result. If you get a different result, it either means that despite your best efforts something important differed between the two studies, or that one of the results was an accident.

By contrast, when people say “conceptual replication” they mean that they have deliberately changed one or more major parts of the study — like different methods, different populations, etc. Theories are abstractions, and in a “conceptual replication” you are testing whether the abstract theoretical statement (in this case, B|A1 > B|A2) is still true under a novel concrete realization of the theory. That is important scientific work, but it differs in huge, qualitative ways from true replication. As I’ve said, it’s not just a difference in empirical procedures; it’s a difference in what kind of inferences you are trying to draw (inferences about a result vs. inferences about a theoretical statement). Describing those simply as 2 varieties of the same thing (2 kinds of replication) blurs this important distinction.

I think this means a few important things for how we think about replications:

1. When judging a replication study, the correct comparison is between the original result and the new one. Even if the original study ran a significance test against a null hypothesis of zero effect, that isn’t the test that matters for the replication. There are probably many ways of making this comparison, but within the NHST framework that is familiar to most psychologists, the proper “null hypothesis” to test against is the one that states that the two studies produced the same result.

2. When we observe a difference between a replication and an original study, we should treat that difference as a problem to be solved. Not (yet) as a conclusive statement about the validity of either study. Study 2 didn’t “fail to replicate” Study 1; rather, Studies 1 and 2 produced different results when they should have produced the same, and we now need to figure out what caused that difference.

3. “Conceptual replication” should depend on a foundation of true (“direct”) replicability, not substitute for it. The logic for this is very much like how validity is strengthened by reliability. It doesn’t inspire much confidence in a theory to say that it is supported by multiple lines of evidence if all of those lines, on their own, give results of poor or unknown consistency.

Paul Meehl on replication and significance testing

Still very relevant today.

A scientific study amounts essentially to a “recipe,” telling how to prepare the same kind of cake the recipe writer did. If other competent cooks can’t bake the same kind of cake following the recipe, then there is something wrong with the recipe as described by the first cook. If they can, then, the recipe is all right, and has probative value for the theory. It is hard to avoid the thrust of the claim: If I describe my study so that you can replicate my results, and enough of you do so, it doesn’t matter whether any of us did a significance test; whereas if I describe my study in such a way that the rest of you cannot duplicate my results, others will not believe me, or use my findings to corroborate or refute a theory, even if I did reach statistical significance. So if my work is replicable, the significance test is unnecessary; if my work is not replicable, the significance test is useless. I have never heard a satisfactory reply to that powerful argument.

Meehl, P. E. (1990). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant using it. Psychological Inquiry, 1, 108-141, 173-180. [PDF]

Does psilocybin cause changes in personality? Maybe, but not so fast

This morning I came across a news article about a new study claiming that psilocybin (the active ingredient in hallucinogenic mushrooms) causes lasting changes in personality, specifically the Big Five factor of openness to experience.

It was hard to make out methodological details from the press report, so I looked up the journal article (gated). The study, by Katherine MacLean, Matthew Johnson, and Roland Griffiths, was published in the Journal of Psychopharmacology. When I read the abstract I got excited. Double blind! Experimentally manipulated! Damn, I thought, this looks a lot better than I thought it was going to be.

The results section was a little bit of a letdown.

Here’s the short version: Everybody came in for 2 to 5 sessions. In session 1 some people got psilocybin and some got a placebo (the placebo was methylphenidate, a.k.a., Ritalin; they also counted as “placebos” some people who got a very low dose of psilocybin in their first session). What the authors report is a significant increase in NEO Openness from pretest to after the last session. That analysis is based on the entire sample of N=52 (everybody got an active dose of psilocybin at least once before the study was over). In a separate analysis they report no significant change from pretest to after session 1 for the n=32 people who got the placebo first. So they are basing a causal inference on the difference between significant and not significant. D’oh!

To make it (even) worse, the “control” analysis had fewer subjects, hence less power, than the “treatment” analysis. So it’s possible that openness increased as much or even more in the placebo contrast as it did in the psilocybin contrast. (My hunch is that’s not what happened, but it’s not ruled out. They didn’t report the means.)

None of this means there is definitely no effect of psilocybin on Openness; it just means that the published paper doesn’t report an analysis that would answer that question. I hope the authors, or somebody else, come back with a better analysis. (A simple one would be a 2×2 ANOVA comparing pretest versus post-session-1 for the placebo-first versus psilocybin-first subjects. A slightly more involved analysis might involve a multilevel model that could take advantage of the fact that some subjects had multiple post-psilocybin measurements.)

Aside from the statistics, I had a few observations.

One thing you’d worry about with this kind of study – where the main DV is self-reported – is demand or expectancy effects on the part of subjects. I know it was double-blind, but they might have a good idea about whether they got psilocybin. My guess is that they have some pretty strong expectations about how shrooms are supposed to affect them. And these are people who volunteered to get dosed with psilocybin, so they probably had pretty positive expectations. I wouldn’t call the self-report issue a dealbreaker, but in a followup I’d love to see some corroborating data (like peer reports, ecological momentary assessments, or a structured behavioral observation of some kind).

On the other hand, they didn’t find changes in other personality traits. If the subjects had a broad expectation that psilocybin would make them better people, you would expect to see changes across the board. If their expectations were focused around Openness-related traits, that’s less relevant.

If you accept the validity of the measures, it’s also noteworthy that they didn’t get higher in neuroticism — which is not consistent with what the government tells you will happen if you take shrooms.

One of the most striking numbers in the paper is the baseline sample mean on NEO Openness — about 64. That is a T-score (normed [such as it is] to have a mean = 50, SD = 10). So that means that in comparison to the NEO norming sample, the average person in this sample was about 1.4 SDs above the mean — which is above the 90th percentile — in Openness. I find that to be a fascinating peek into who volunteers for a psilocybin study. (It does raise questions about generalizability though.)

Finally, because psilocybin was manipulated within subjects, the long-term (one year-ish) followup analysis did not have a control group. Everybody had been dosed. They predicted Openness at one year out based on the kinds of trip people reported (people who had a “complete mystical experience” also had the sustained increase in openness). For a much stronger inference, of course, you’d want to manipulate psilocybin between subjects.

The usability of statistics; or, what happens when you think that (p=.05) != (p=.06)

The difference between significant and not significant is not itself significant.

That is the title of a 2006 paper by statisticians Andrew Gelman and Hal Stern. It is also the theme of a new review article in Nature Neuroscience by Sander Nieuwenhuis, Birte U Forstmann, and Eric-Jan Wagenmakers (via Gelman’s blog). The review examined several hundred papers in behavioral, systems, and cognitive neuroscience. Of all the papers that tried to compare two effects, about half of them made this error instead of properly testing for an interaction.

I don’t know how often the error makes it through to published papers in social and personality psychology, but I see it pretty regularly as a reviewer. I call it when I see it; sometimes other reviewers call it out too, sometimes they don’t.

I can also remember making this error as a grad student – and my advisor correcting me on it. But the funny thing is, it’s not something I was taught. I’m quite sure that nowhere along the way did any of my teachers say you can compare two effects by seeing if one is significant and the other isn’t. I just started doing it on my own. (And now I sometimes channel my old advisor and correct my own students on the same error, and I’m sure nobody’s teaching it to them either.)

If I wasn’t taught to make this error, where was I getting it from? When we talk about whether researchers have biases, usually we think of hot-button issues like political bias. But I think this reflects a more straightforward kind of bias — old habits of thinking that we carry with us into our professional work. To someone without scientific training, it seems like you should be able to ask “Does X cause Y, yes or no?” and expect a straightforward answer. Scientific training teaches us a couple of things. First, the question is too simple: it’s not a yes or no question; the answer is always going to come with some uncertainty; etc. Second, the logic behind the tool that most of us use – null hypothesis significance testing (NHST) – does not even approximate the form of the question. (Roughly: “In a world where X has zero effect on Y, would we see a result this far from the truth less than 5% of the time?”)

So I think what happens is that when we are taught the abstract logic of what we are doing, it doesn’t really pervade our thinking until it’s been ground into us through repetition. For a period of time – maybe in some cases forever – we carry out the mechanics of what we have been taught to do (run an ANOVA) but we map it onto our old habits of thinking (“Does X cause Y, yes or no?”). And then we elaborate and extrapolate from them in ways that are entirely sensible by their own internal logic (“One ANOVA was significant and the other wasn’t, so X causes Y more than it causes Z, right?”).

One of the arguments you sometimes hear against NHST is that it doesn’t reflect the way researchers think. It’s a sort of usability argument: NHST is the butterfly ballot of statistical methods. In principle, I don’t think that argument carries the day on its own (if we need to use methods and models that don’t track our intuitions, we should). But it should be part of the discussion. And importantly, the Nieuwenhuis et al. review shows us how using unintuitive methods can have real consequences.