p-hacking – The Hardest Science

Data analysis is thinking, data analysis is theorizing

Published on November 2, 2018November 2, 2018 by Sanjay Srivastava

There is a popular adage about academic writing: “Writing is thinking.” The idea is this: There is a simple view of writing as just an expressive process – the ideas already exist in your mind and you are just putting them on a page. That may in fact be true some of the time, like for day-to-day emails or texting friends or whatever. But “writing is thinking” reminds us that for most scholarly work, the process of writing is a process of continued deep engagement with the world of ideas. Sometimes before you start writing you think you had it all worked out in your head, or maybe in a proposal or outline you wrote. But when you sit down to write the actual thing, you just cannot make it make sense without doing more intellectual heavy lifting. It’s not because you forgot, it’s not because you “just” can’t find the right words. It’s because you’ve got more thinking to do, and nothing other than sitting down and trying to write is going to show that to you.*

Something that is every bit as true, but less discussed and appreciated, is that in the quantitative sciences, the same applies to working with data. Data analysis is thinking. The ideas you had in your head, or the general strategy you wrote into your grant application or IRB protocol, are not enough. If that is all you have done so far, you almost always still have more thinking to do.

This point is exemplified really well in the recent Many Analysts, One Data Set paper. Twenty-nine teams of data analysts were given the same scientific hypothesis to test and the same dataset to test it in. But no two teams ran the same analysis, resulting in 29 different answers. This variability was neither statistical noise nor human error. Rather, the differences in results were because of different reasoned decisions by experienced data analysts. As the authors write in the introduction:

In the scientific process, creativity is mostly associated with the generation of testable hypotheses and the development of suitable research designs. Data analysis, on the other hand, is sometimes seen as the mechanical, unimaginative process of revealing results from a research study. Despite methodologists’ remonstrations…, it is easy to overlook the fact that results may depend on the chosen analytic strategy, which itself is imbued with theory, assumptions, and choice points.

The very end of the quote drives home a second, crucial point. Data analysis is thinking, but it is something else too. Data analysis is theorizing. And it is theorizing no matter how much or how little the analyst is thinking about it.

Scientific theory is not your mental state. Scientific theory consists of statements about nature. When, say, you decide on a scheme for how to exclude outliers in a response-time task, that decision implies a theory of which observations result from processes that are irrelevant to what you are studying and therefore ignorable. When you decide on how to transform variables for a regression, that decision implies a theory of the functional form of the relationships between measured variables. These theories may be longstanding ones, well-developed and deeply studied in the literature. Or they may be ad hoc, one-off implications. Moreover, the content of the analyst’s thinking may be framed in theoretical terms (“hmmm let me think through what’s generating this distribution”), or it may be shallow and rote (“this is how my old advisor said to trim response times”). But the analyst is still making decisions** that imply something about something in nature – the decisions are “imbued with theory.” That’s why scientists can invoke substantive reasons to critique each other’s analyses without probing each other’s mental states. “That exclusion threshold is too low, it excludes valid trials” is an admissible argument, and you don’t have to posit what was in the analyst’s head when you make it.

So data analysis decisions imply statements in theory-space, and in order to think well about them we probably need to think in that space too. To test one theory of interest, the process of data analysis will unavoidably invoke other theories. This idea is not, in fact, new. It is a longstanding and well-accepted principle in philosophy of science called the Duhem-Quine thesis. We just need to recognize that data analysis is part of that web of theories.

This gives us an expanded framework to understand phenomena like p-hacking. The philosopher Imre Lakatos said that if you make a habit of blaming the auxiliaries when your results don’t support your main theory, you are in what he called a degenerative research programme. As you might guess from the name, Imre wasn’t a fan. When we p-hack – try different analysis specifications until we get one we like – we are trying and discarding different configurations of auxiliary theories until we find one that lets us draw a preferred conclusion. We are doing degenerative science, maybe without even realizing it.

On the flip side, this is why preregistration can be a deeply intellectually engaging and rewarding process.*** Because without the data whispering in your ear, “Try it this way and if you get an asterisk we can go home,” you have one less shortcut around thinking about your analysis. You can, of course, leave the thinking until later. You can do so with full awareness and transparency: “This is an exploratory study, and we plan to analyze the data interactively after it is collected.” Or you can fool yourself, and maybe others, if you write a vague or partial preregistration. But if you commit to planning your whole data analysis workflow in advance, you will have nothing but thinking and theorizing to guide you through it. Which, sooner or later, is what you’re going to have to be doing.

* Or, you can write it anyway and not make sense, which also has a parallel in data analysis.
** Or outsourcing them to the software developer who decided what defaults to put in place.
*** I initially dragged my heels on starting to preregister – I know, I know – but when I finally started doing it with my lab, we experienced this for ourselves, somewhat to my own surprise.

What if we talked about p-hacking the way we talk about experimenter effects?

Published on October 19, 2018 by Sanjay Srivastava

Discussions about p-hacking sometimes go sideways. A hypothetical exchange might go like this:

READER: Those p-values are all hovering just below .05, I bet the authors p-hacked.

AUTHOR: I know that I did not p-hack, and I resent the accusation.

By comparison, consider how we talk about another form of potential bias: experimenter effects.

It is widely accepted that experimenters’ expectations, beliefs, or other characteristics can influence participants in behavioral experiments and medical trials. We also accept that this can happen without intent or even awareness on the part of the experimenter. Expectations about how participants receiving a treatment are supposed to differ from those receiving a placebo might show up in the experimenter’s behavior in subtle ways that could influence the participants.

We also don’t have a complete theory of experimenter effects that allows us to reliably measure every manifestation or predict with high confidence when they will and won’t occur. So instead, we consider them as an open possibility in a wide range of situations. As a result, it is also widely accepted that using procedural safeguards against experimenter effects is a best practice in most experiments where a human experimenter will interact with subjects.

Because of all these shared assumptions, discussions around experimenter effects are often much less heated. If you are presenting a study design at lab meeting, and someone says “you’ll need to keep your RAs blind to condition, here’s an idea how to do that…” that’s generally considered a helpful suggestion rather than an insinuation of planned malfeasance.

And even after a study is done, it is generally considered fair game to ask about blinding and other safeguards, and incorporate their presence or absence into an evaluation of a study. If a study lacks such safeguards, authors generally don’t say things like “I would never stoop so low as to try to influence my participants, how dare you!” Everybody, including authors, understands that experimenters don’t always know how they might be influencing subjects. And when safeguards are missing, readers typically treat it as a reason for doubt and uncertainty. We allow and even expect readers to calibrate that uncertainty judgment based on other assumptions or information, like how plausible the effect seems, how strong or weak did partial or incomplete safeguards seem, etc.

For some reason though, when it comes to potential sources of bias in data analysis, we have not (yet) reached a place where we can talk about it in a similar way. This is despite the fact that it has a lot in common with experimenter effects.

It is certainly possible for somebody to deliberately and strategically p-hack, just like it’s possible for an experimenter to wink and nudge and say “are you sure you’re not feeling better?” or whatever. But bias in data analysis does not have to happen that way. Analysts do not have to have intention or even awareness in order to do things that capitalize on chance.

Consider, first of all, that almost every data analysis involves many decisions: what data to include or exclude, whether or how to transform it, a zillion possibilities in specifying the analysis (what particular variables to look at, what analyses to run on them, whether to use one- or two-tailed tests, what covariates to include, which main, interactive, simple, or contrast effect[s] to treat as critical tests of the hypothesis, etc.), and then decisions about what to report. We psychologists of all people know that you cannot un-know something. So once the analyst has seen anything about the data – distributions, scatterplots, preliminary or interim analyses, whatever else – all the subsequent decisions will be made by a person who has that knowledge. And after that point, it is simply impossible for anybody – including the analyst – to state with any confidence how those decisions might otherwise have been made without that knowledge. Which means that we have to treat seriously the possibility that the analyst made decisions that overfit the analyses to the data.

More subtly, as Gelman and Loken discuss in their “forking paths” paper, bias is not defined by a behavior (how many analyses did you run?), but by a set of counterfactuals (how many analyses could you have run?). So even if the objective history is that one and only one analysis was run, that is not a guarantee of no bias.

What all of this means is that when it comes to bias in data analysis, we are in very much a similar situation as with experimenter effects. It is virtually impossible to measure or observe it happening in a single instance, even by the person doing the data analysis. But what we can do is define a broad set of circumstances where we have to take it seriously as a possibility.

It would be great if we could collectively shift our conversations around this issue. I think that would involve changes from both critical readers and from authors.

Start by considering procedures, not behavior or outcomes. Were safeguards in place, and if so, how effective were they? For bias in data analysis, the most common safeguard is preregistration. The mere existence of a preregistration (as indicated by a badge or an OSF link in a manuscript) tells you very little though – many of them do not actually constrain bias. Sometimes that is even by design (for example, preregistering an exploratory study is a great way to prevent editors or reviewers from pressuring you to HARK later on). A preregistration is just a transparency step, you have to actually read it to find out what it does. In order for a preregistration to prevent analytic bias, it has to do two things. First, it has to have a decision inventory – that is, it has to identify all of the decisions about what data to collect/analyze, how to analyze it, and what to report. So ask yourself: is there a section on exclusions? Transformations? Does it say what the critical test is? Etc. (This will be easier to do in domains where you are familiar with the analytic workflow for the research area. It can also be aided by consulting templates. And if authors write and post analysis code as part of a preregistration, that can make things clear too.) Second, the preregistration has to have a plan for all of those decision points. To the extent that the inventory is complete and the plans are specific and were determined separate from the data, the preregistration can be an effective safeguard against bias.

When safeguards are missing or incomplete, everyone – authors and readers alike -should treat analytic bias as a serious possibility. If there is no preregistration or other safeguards, then bias is possible. If there is a preregistration but it was vague or incomplete, bias is also possible. In a single instance it is often impossible to know what actually happened, for the reasons I discussed above. It can be reasonable to start looking at indirect stuff like statistical evidence (like the distribution of p-values), whether the result is a priori implausible, etc. Inferences about these things should be made with calibrated uncertainty. p-curves are neither perfect nor useless; improbable things really do happen though by definition rarely; etc. So usually we should not be too sure in any direction.

Inferences about authors should be rare. We should have a low bar for talking about science and a high bar for talking about scientists. This cuts both ways. Casual talk challenging authors’ competence, intentions, unreported behaviors, etc. is often both hurtful and unjustified when we are talking about single papers.* But also, authors’ positive assertions about their character, behavior, etc. rarely shed light and can have the perverse effect of reinforcing the message that they, and not just the work, are a legitimate part of the conversation. As much as possible, make all the nouns in your discussion things like “the results,” “the procedure,” etc. and not “the authors” (or for that matter “my critics”). And whether you are an author, a critic, or even an observer, you can point out when people are talking about authors and redirect the conversation to the work.

I realize this last item draws a razor-thin line and maybe sometimes it is no line at all. After all, things like what safeguards were in place, and what happened if they weren’t, are results of the researcher’s behavior. So even valid criticism implicates what the authors did or didn’t do, and it will likely be personally uncomfortable for them. But it’s a distinction that’s worth observing as much as you can when you criticize work or respond to criticisms. And I would hope we’ve learned from the ways we talk about experimenter effects that it is possible to have less heated, and frankly more substantive, discussions about bias when we do that.

Finally, it is worth pointing out that preregistration and other safeguards are still really new to psychology and many other scientific fields. We are all still learning, collectively, how to do them well. That means that we need to be able to criticize them openly, publicly, and vigorously – if we do not talk about them, we cannot get better at doing them. But it also means that some preregistration is almost always better than none, because even a flawed or incomplete one will increase transparency and make it possible to criticize work more effectively. Even as we critique preregistrations that could have been done better, we should recognize that anybody who makes that critique and improvement possible has done something of value.

* In the bigger picture, for better or worse, science pins career advancement, resources, prestige, etc. to people’s reputations. So at some point we have to be able to talk about these things. This is a difficult topic and not something I want to get into here, other than to say that discussions about who is a good scientist are probably better left to entirely separate conversations from ones where we scientifically evaluate single papers, because the evidentiary standards and consequences are so different.

Is there p-hacking in a new breastfeeding study? And is disclosure enough?

Published on March 18, 2015March 18, 2015 by Sanjay Srivastava3 Comments

There is a new study out about the benefits of breastfeeding on eventual adult IQ, published in The Lancet Global Health. It’s getting lots of news coverage, for example in NPR, BBC, New York Times, and more.

A friend shared a link and asked what I thought of it. So I took a look at the article and came across this (emphasis added):

We based statistical comparisons between categories on tests of heterogeneity and linear trend, and we present the one with the lower p value. We used Stata 13·0 for the analyses. We did four sets of analyses to compare breastfeeding categories in terms of arithmetic means, geometric means, median income, and to exclude participants who were unemployed and therefore had no income.

Yikes. The description of the analyses is frankly a little telegraphic. But unless I’m misreading it, or they did some kind of statistical correction that they forgot to mention, it sounds like they had flexibility in the data analyses (I saw no mention of pre-registration in the analysis plan), they used that flexibility to test multiple comparisons, and they’re openly disclosing that they used p-values for model selection – which is a more technical way of saying they engaged in p-hacking. (They don’t say how they selected among the 4 sets of analyses with different kinds of means etc.; was that based on p-values too?)*

From time to time students ask, Am I allowed to do x statistical thing? And my standard answer is, in the privacy of your office/lab/coffeeshop/etc. you are allowed to do whatever you want! Exploratory data analysis is a good thing. Play with your data and learn from it.** But if you are going to publish the results of your exploration, then disclose. If you did something that could bias your p-values, let readers know and they can make an informed evaluation.***

But that advice assumes that you are talking to a sophisticated reader. When it comes time to talk to the public, via the press, you have a responsibility to explain yourself. “We used a statistical approach that has an increased risk of producing false positives when there is no effect, or overestimating the size of effects when they are real.”

And if that weakens your story too much, well, that’s valid. Your story is weaker. Scientific journals are where experts communicate with other experts, and it could still be interesting enough to publish for that audience, perhaps to motivate a more definitive followup study. But if it’s too weak to go to the public and tell mothers what to do with their bodies… Maybe save the press release for the pre-registered Study 2.

—–

* The study has other potential problems which are pretty much par for the course in these kinds of observational studies. They try to statistically adjust for differences between kids who were breastfed and those who weren’t, but that assumes that you have a complete and precisely measured set of all relevant covariates. Did they? It’s not a testable assumption, though it’s one that experts can make educated guesses at. On the plus side, when they added potentially confounding variables to the models the effects got stronger, not weaker. On the minus side, as Michelle Meyer pointed out on Twitter, they did not measure or adjust for parental IQ, which will definitely be associated with child IQ and for which the covariates they did use (like parental education and income) are only rough proxies.

** Though using p-values to guide your exploratory data analysis isn’t the greatest idea.

*** Some statisticians will no doubt disagree and say you shouldn’t be reporting p-values with known bias. My response is (a) if you want unbiased statistics then you shouldn’t be reading anything that’s gone through pre-publication review, and (b) that’s what got us into this mess in the first place. I’d rather make it acceptable for people to disclose everything, as opposed to creating an expectation and incentive for people to report impossibly clean results.

Tag: p-hacking

Data analysis is thinking, data analysis is theorizing

Like this:

What if we talked about p-hacking the way we talk about experimenter effects?

Like this:

Is there p-hacking in a new breastfeeding study? And is disclosure enough?

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: