Is there p-hacking in a new breastfeeding study? And is disclosure enough?

There is a new study out about the benefits of breastfeeding on eventual adult IQ, published in The Lancet Global Health. It’s getting lots of news coverage, for example in NPR, BBC, New York Times, and more.

A friend shared a link and asked what I thought of it. So I took a look at the article and came across this (emphasis added):

We based statistical comparisons between categories on tests of heterogeneity and linear trend, and we present the one with the lower p value. We used Stata 13·0 for the analyses. We did four sets of analyses to compare breastfeeding categories in terms of arithmetic means, geometric means, median income, and to exclude participants who were unemployed and therefore had no income.

Yikes. The description of the analyses is frankly a little telegraphic. But unless I’m misreading it, or they did some kind of statistical correction that they forgot to mention, it sounds like they had flexibility in the data analyses (I saw no mention of pre-registration in the analysis plan), they used that flexibility to test multiple comparisons, and they’re openly disclosing that they used p-values for model selection – which is a more technical way of saying they engaged in p-hacking. (They don’t say how they selected among the 4 sets of analyses with different kinds of means etc.; was that based on p-values too?)*

From time to time students ask, Am I allowed to do x statistical thing? And my standard answer is, in the privacy of your office/lab/coffeeshop/etc. you are allowed to do whatever you want! Exploratory data analysis is a good thing. Play with your data and learn from it.** But if you are going to publish the results of your exploration, then disclose. If you did something that could bias your p-values, let readers know and they can make an informed evaluation.***

But that advice assumes that you are talking to a sophisticated reader. When it comes time to talk to the public, via the press, you have a responsibility to explain yourself. “We used a statistical approach that has an increased risk of producing false positives when there is no effect, or overestimating the size of effects when they are real.”

And if that weakens your story too much, well, that’s valid. Your story is weaker. Scientific journals are where experts communicate with other experts, and it could still be interesting enough to publish for that audience, perhaps to motivate a more definitive followup study. But if it’s too weak to go to the public and tell mothers what to do with their bodies… Maybe save the press release for the pre-registered Study 2.

—–

* The study has other potential problems which are pretty much par for the course in these kinds of observational studies. They try to statistically adjust for differences between kids who were breastfed and those who weren’t, but that assumes that you have a complete and precisely measured set of all relevant covariates. Did they? It’s not a testable assumption, though it’s one that experts can make educated guesses at. On the plus side, when they added potentially confounding variables to the models the effects got stronger, not weaker. On the minus side, as Michelle Meyer pointed out on Twitter, they did not measure or adjust for parental IQ, which will definitely be associated with child IQ and for which the covariates they did use (like parental education and income) are only rough proxies.

** Though using p-values to guide your exploratory data analysis isn’t the greatest idea.

*** Some statisticians will no doubt disagree and say you shouldn’t be reporting p-values with known bias. My response is (a) if you want unbiased statistics then you shouldn’t be reading anything that’s gone through pre-publication review, and (b) that’s what got us into this mess in the first place. I’d rather make it acceptable for people to disclose everything, as opposed to creating an expectation and incentive for people to report impossibly clean results.

Some reflections on the Bargh-Doyen elderly walking priming brouhaha

Recently a controversy broke out over the replicability of a study John Bargh et al. published in 1996. The study reported that unconsciously priming a stereotype of elderly people caused subjects to walk more slowly. A recent replication attempt by Stephane Doyen et al., published in PLoS ONE, was unable to reproduce the results. (Less publicized, but surely relevant, is another non-replication by Hal Pashler et al.) Ed Yong wrote up an article about it  in Discover, which last week drew a sharp response from Bargh.

The broader context is that there has been a large and ongoing discussion about replication in psychology (i.e., that there isn’t enough of it). I don’t have much to say about whether the elderly-walking effect is real. But this controversy has raised a number of issues about scientific discourse online as well as about how we think about replication.

The discussion has been unnecessarily inflammatory – on all sides. Bargh has drawn a lot of criticism for his response, which among other things included factual errors about PLoS ONE, suggestions that Doyen et al. were “incompetent or ill-informed,” and a claim that Yong was practicing irresponsible journalism. The PLoS ONE editors posted a strongly worded but civil response in the comments, and Yong has written a rebuttal. As for the scientific issue — is the elderly-priming effect real? — Daniel Simons has written an excellent post on the many, many reasons why an effect might fail to replicate. A failure to replicate does not need to impeach the honesty or scientific skills of either the original researcher or the replicator. It does not even mean the effect is not real. In an ideal world, Bargh should have treated the difference between his results and those of Doyen et al. as a puzzle to be worked out, not as a personal attack to be responded to in kind.

But… it’s not as though Bargh went bananas over a dispassionate report of a non-replication. Doyen et al. strongly suggested that Bargh et al.’s procedure had been contaminated by expectancy effects. Since expectancy effects are widely known about in behavioral science (raise your hand if you have heard the phrase “double-blind”), the implication was that Bargh had been uncareful. And Ed Yong ran with that interpretation by leading off his original piece with the tale of Clever Hans. I don’t know whether Doyen or Yong meant to be inflammatory: I know nothing about Doyen; and in Yong’s case, based on his journalistic record, I doubt it (and he apparently gave Bargh plenty of opportunity to weigh in before his original post went live). But wherever you place the blame, a scientifically unfortunate result is that all of the other reasonable possibilities that Simons lists have been mostly ignored by the principals in this discussion.

Are priming effects hard to produce or easy? A number of priming researchers have suggested that priming effects are hard to get reliably. This doesn’t mean they aren’t important — experiments require isolation of the effect of interest, and the ease of isolating a phenomenon is not the same thing as its importance. (Those Higgs bosons are so hard to detect — so even if they exist they must not matter, right?) Bargh makes this point in his response too, suggesting that if Doyen et al. accidentally called subjects’ conscious attention to the elderly stereotype, that could wash out the effect (because conscious attention can easily interfere with automatic processes).

That being said… the effects in the original Bargh et al. report were big. Really big, by psychology standards. In experiment 2a, Bargh et al. report t(28) = 2.86, which corresponds to an effect size of d = 1.08. And in their replication, experiment 2b, they report t(28) = 2.16, which translates to d = 0.82. So even if we account for some shrinkage, under the right conditions it should not be hard for somebody to reproduce the elderly-walking priming effect in a new study.

The expectancy effects study is rhetorically powerful but proves little. In their Experiment 1, Doyen et al. tested the same hypothesis about priming stereotypes that Bargh tested. But in Experiment 2, Doyen et al. tested a hypothesis about experimenter expectancies. That is a completely different hypothesis. The second study tells us that experimenter expectancies can affect walking speed. But walking speed surely can be affected by more than one thing. So Experiment 2 does not tell us to what extent, if any at all, differences in walking speed were caused by experimenter expectancies in Bargh’s experiment (or for that matter, anywhere else in the natural world outside of Doyen’s lab). This is the inferential error of confusing causes of effects with effects of causes. Imagine that Doyen et al. had clubbed the subjects in the elderly-prime condition in the knee; most likely that would have slowed them down. But would we take that as evidence that Bargh et al. had done the same?

The inclusion of Experiment 2 served a strong rhetorical function, by planting in the audience’s mind the idea that the difference between Bargh vs Doyen Exp 1 was due to expectancy effects (and Ed Yong picked up and ran with this suggestion by referring to Clever Hans). But scientifically, all it shows is that expectancy effects can influence the dependent variable in the Bargh experiment. That’s not nothing, but anybody who already believes that experiments need to be double-blind should have seen that coming. If we had documentary evidence that in the actual 1996 studies Bargh et al. did not actually eliminate expectancy effects, that would be relevant. (We likely never will have such evidence; see next point.) But Experiment 2 does not shed nearly as much light as it appears to.

We need more openness with methods and materials. When I started off in psychology, someone once told me that a scientific journal article should contain everything you need to reproduce the experiment (either directly or via references to other published materials). That, of course, is almost never true and maybe is unrealistic. Especially when you factor in things like lab skills, many of which are taught via direct apprenticeship rather than in writing, and which matter just as much in behavioral experiments as they do in more technology-heavy areas of science.

But with all that being said, I think we could do a lot better. A big part of the confusion in this controversy is over the details of methods — what exactly did Bargh et al. do in the original study, and how closely did Doyen et al. reproduce the procedure? The original Bargh et al. article followed the standards of its day in how much methodological detail it reported. Bargh later wrote a methods chapter that described more details of the priming technique (and which he claims Doyen et al. did not follow). But in this era of unlimited online supplements, there is no reason why in future studies, all of the stimuli, instructions, etc. could not be posted. That would enormously aid replication attempts.

What makes for a “failed” replication? This turns out to be a small point in the present context but an important one in a more general sense, so I couldn’t help but make it. We should be very careful about the language of “successful” and “failed” replications when it is based on the difference between p<.05 and p>.05. That is, just because the original study could reject the null and the replication could not, that doesn’t mean that the replication is significantly different from the original study. If you are going to say you failed to replicate the original result, you should conduct a test of that difference.

As far as I can tell neither Doyen et al. nor Pashler et al. did that. So I did. I converted each study’s effect to an r effect size and then comparing the studies with a z test of the difference between independent rs, and indeed Doyen et al. and Pashler et al. each differed from Bargh’s original experiments. So this doesn’t alter the present discussion. But as good practice, the replication reports should have reported such tests.

Journals can be groundbreaking or definitive, not both

I was recently invited to contribute to Personality and Social Psychology Connections, an online journal of commentary (read: fancy blog) run by SPSP. Don Forsyth is the editor, and the contributors include David Dunning, Harry Reis, Jennifer Crocker, Shige Oishi, Mark Leary, and Scott Allison. My inaugural post is titled “Groundbreaking or definitive? Journals need to pick one.” Excerpt:

Do our top journals need to rethink their missions of publishing research that is both groundbreaking and definitive? And as a part of that, do they — and we scientists — need to reconsider how we engage with the press and the public?…

In some key ways groundbreaking is the opposite of definitive. There is a lot of hard work to be done between scooping that first shovelful of dirt and completing a stable foundation. And the same goes for science (with the crucial difference that in science, you’re much more likely to discover along the way that you’ve started digging on a site that’s impossible to build on). “Definitive” means that there is a sufficient body of evidence to accept some conclusion with a high degree of confidence. And by the time that body of evidence builds up, the idea is no longer groundbreaking.

Read it here.

 

The brain scans, they do nothing

Breaking news: New brain scan reveals nothing at all.

This is an amazing discovery’, said leading neuroscientist Baroness Susan Greenfield, ‘the pictures tell us nothing about how the brain works, provide us with no insights into the nature of human consciousness, and all with such lovely colours.’ …

The development, which has been widely reported around the world, is also significant because it allows journalists to publish big fancy pictures of the brain that look really impressive while having little or no explanatory value.

I’ve previously mentioned the well documented bias to think that brain pictures automatically make research more sciencey, even if the pictures are irrelevant to the conclusions. Satire makes that point a lot better though.

In science, rejection is normal

In the news: A coupla guys played around with some #2 pencils and Scotch tape and won a Nobel Prize in physics. Talk about easy science! This is what happens when you work in a field with such low-hanging fruit that you run out of testable hypotheses.

Okay, kidding aside…

The initial NY Times report noted that the first paper on graphene that the researchers wrote was rejected by Nature before later being published in Science. [1]

It would be easy to fit that into a narrative that is common in movies and in science journalism: the brilliant iconoclasts rejected by the hidebound scientific establishment.

Far more likely though is a much more mundane explanation: scientists see their work rejected all the time. It’s just part of how science works. The review process is not perfect, and sometimes you have to shop a good idea around for a while before you can convince people of its merit. And the more productive you are, the more rejection experiences you will accumulate over a career.

It’s a good reminder that if you’re a working scientist (or trying to start a career as one), don’t get too worked up about rejection.

[1] Puzzling sidenote: For some reason that part no longer appears in the article on the NY Times website, but since there’s no correction statement I’ll still assume that it’s true and they just edited it out of a later edition for some reason. The rejection anecdote still appears on the PBS website.

Pretty pictures of brains are more convincing

This study seemed like it was begging to be done, so I figured somebody must have done it already. Thank you Google Scholar for helping me find it…

Seeing is believing: The effect of brain images on judgments of scientific reasoning [pdf]

David P. McCabe and Alan D. Castel

Brain images are believed to have a particularly persuasive influence on the public perception of research on cognition. Three experiments are reported showing that presenting brain images with articles summarizing cognitive neuroscience research resulted in higher ratings of scientific reasoning for arguments made in those articles, as compared to articles accompanied by bar graphs, a topographical map of brain activation, or no image. These data lend support to the notion that part of the fascination, and the credibility, of brain imaging research lies in the persuasive power of the actual brain images themselves. We argue that brain images are influential because they provide a physical basis for abstract cognitive processes, appealing to people’s affinity for reductionistic explanations of cognitive phenomena.

For a few years now I’ve been joking that I should end every talk with a slide of a random brain image, and conclude, “Aaaannnd… all of this happens in the brain!” This is solid evidence that doing so would help my credibility.

Now, the next big question is: who’s going to replicate this with psychologists and neuroscientists as the subjects?

Apparently I’m on a blogging break

I just noticed that I haven’t posted in over a month. Don’t fear, loyal readers (am I being presumptuous with that plural? hi Mom!). I haven’t abandoned the blog, apparently I’ve just been too busy or preoccupied to flesh out any coherent thoughts.

So instead, here are some things that, over the last month, I’ve thought about posting but haven’t summoned up the wherewithal to turn into anything long enough to be interesting:

  • Should psychology graduate students routinely learn R in addition to, or perhaps instead of, other statistics software? (I used to think SPSS or SAS was capable enough for the modal grad student and R was too much of a pain in the ass, but I’m starting to come around. Plus R is cheaper, which is generally good for graduate students.)
  • What should we do about gee-whiz science journalism covering social neuroscience that essentially reduces to, “Wow, can you believe that X happens in the brain?” (Still working on that one. Maybe it’s too deeply ingrained to do anything.)
  • Reasons why you should read my new commentary in Psychological Inquiry. (Though really, if it takes a blog post to explain why an article is worth reading, maybe the article isn’t worth reading. I suggest you read it and tell me.)
  • A call for proposals for what controversial, dangerous, or weird research I should conduct now that I just got tenure.
  • Is your university as sketchy as my university? (Okay, my university probably isn’t really all that sketchy. And based on the previous item, you know I’m not just saying that to cover my butt.)
  • My complicated reactions to the very thought-provoking Bullock et al. “mediation is hard” paper in JPSP.

Our spring term is almost over, so maybe I’ll get to one of these sometime soon.

On base rates and the “accuracy” of computerized Facebook gaydar

I never know what to make of reports stating the “accuracy” of some test or detection algorithm. Take this example, from a New York Times article by Steve Lohr titled How Privacy Vanishes Online:

In a class project at the Massachusetts Institute of Technology that received some attention last year, Carter Jernigan and Behram Mistree analyzed more than 4,000 Facebook profiles of students, including links to friends who said they were gay. The pair was able to predict, with 78 percent accuracy, whether a profile belonged to a gay male.

I have no idea what “78 percent accuracy” means in this context. The most obvious answer would seem to be that of all 4,000 profiles analyzed, 78% were correctly classified as gay versus not gay. But if that’s the case, I have an algorithm that beats the pants off of theirs. Are you ready for it?

Say that everybody is not gay.

Figure that around 5 to 10 percent of the population is gay. If these 4,000 students are representative of that, then saying not gay every time will yield an “accuracy” of 90-95%.

But wait — maybe by “accuracy” they mean what percentage of gay people are correctly identified as such. In that case, I have an algorithm that will be 100% accurate by that standard. Ready?

Say that everybody is gay.

You can see how silly this gets. To understand how good the test is, you need two numbers: sensitivity and specificity. My algorithms each turn out to be 100% on one and 0% on the other. Which means that they’re both crap. (A good test needs to be high on both.) I am hoping that the MIT class’s algorithm was a little better, and the useful numbers just didn’t get translated. But this news report tells us nothing that we need to know to evaluate it.

Say it again

When students learn writing, they often are taught that if you have to say the same kind of thing more than once, word things in a slightly different way each time. The idea is to add interest through variety.

But when I work with psychology students on their writing, I often have to work hard to break them of that habit. In scientific writing, precision and clarity are the most important. This doesn’t mean that scientific writing cannot also be elegant and interesting (the vary-the-wording strategy is often just a cheap trick anyhow). But your first priority is to make sure that your reader knows exactly what you mean.

Problems arise when journalists trained in vary-the-wording write about statistics. Small thing, but take this sentence from a Slate piece (in the oft-enlightening Explainer column) about the Fort Hood shooting:

Studies have shown that the suicide rate among male doctors is 40 percent higher than among men overall and that female doctors take their own lives at 130 percent the rate of women in general.

The same comparison is being made for men and for women: how does the suicide rate among doctors compare to the general population? But the numbers are not presented in parallel. For men, the number presented is 40, as in “40 percent higher than” men in general. For women, the number is 130, as in “130 percent the rate of” women in general.

The prepositions are the tipoff that the writer is doing different things, and a careful reader can probably figure that out. But the attempt to add variety just bogs things down. A reader will have to slow down and possibly re-read once or twice to figure out that 40% and 130% are both telling us that doctors commit suicide more often than others.

Separately: why break it out by gender? In context, the writer is trying to make a point about doctors versus everybody else. Not male doctors versus female doctors. We often reflexively categorize things by gender (I’m using “we” in a society-wide sense) when it’s unnecessary and uninformative.

Improving the grant system ain’t so easy

Today’s NY Times has an article by Gina Kolata about how the National Cancer Institute plays it safe with grant funding. The main point of the article is that NCI funds too many “safe” studies — studies that promise a high probability of making a modest, incremental discovery. This is done at the expense of more speculative and exploratory studies that take bigger risks but could lead to greater leaps in knowledge.

The article, and by and large the commenters on it, seem to assume that things would be better if the NCI funded more high-risk research. Missing is any analysis of what might be the downsides of adopting such a strategy.

By definition, a high-risk proposal has a lower probabilty of producing usable results. (That’s what people mean by “risk” in this context.) So for every big breakthrough, you’d be funding a larger number of dead ends. That raises three problems: a substantive policy problem, a practical problem, and a political problem.

1. The substantive problem is in knowing what would be the net effect of changing the system. If you change the system so that you invest grant dollars in research that pays off half as often, but when it does the findings are twice as valuable, it’s a wash — you haven’t made things better or worse overall. So it’s a problem of adjusting the system to optimize the risk X reward payoffs. I’m not saying the current situation is optimal; but nobody is presenting any serious analysis of whether an alternative investment strategy would be better.

2. The practical problem is that we would have to find some way to choose among high-risk studies. The problem everybody is pointing to is that in the current system, scientists have to present preliminary studies, stick to incremental variations on well-established paradigms, reassure grant panels that their proposal is going to pay off, etc. Suppose we move away from that… how would you choose amongst all the riskier proposals?

People like to point to historical breakthroughs that never would have been funded by a play-it-safe NCI. But it may be a mistake to believe those studies would have been funded by a take-a-risk NCI, because we have the benefit of hindsight and a great deal of forgetting. Before the research was carried out — i.e., at the time it would have been a grant proposal — every one of those would-be-breakthrough proposals would have looked just as promising as a dozen of their contemporaries that turned out to be dead-ends and are now lost to history. So it’s not at all clear that all of those breakthroughs would have been funded within a system that took bigger risks, because they would have been competing against an even larger pool of equally (un)promising high-risk ideas.

3. The political problem is that even if we could solve #1 and #2, we as a society would have to have the stomach for putting up with a lot of research that produces no meaningful results. The scientific community, politicians, and the general public would have to be willing to constantly remind themselves that scientific dead ends are not a “waste” of research dollars — they are the inevitable consequence of taking risks. There would surely be resistance, especially at the political level.

So what’s the solution? I’m sure there could be some improvements made within the current system, especially in getting review panels and program officers to reorient to higher-risk studies. But I think the bigger issue has to do with the overall amount of money available. As the top-rated commenter on Kolata’s article points out, the FY 2010 defense appropriation is more than 6 times what we have spent at NCI since Nixon declared a “war” on cancer 38 years ago. If you make resources scarce, of course you’re going to make people cautious about how they invest those resources. There’s a reason angel investors are invariably multi-millionnaires. If you want to inspire the scientific equivalent of angel investing, then the people giving out the money are going to have to feel like they’ve got enough money to take risks with.