How do you think about statistical methods in science? Are statistics a matter of math and logic? Or are they a useful tool? Over time, I have noticed that these seem to be two implicit frames for thinking about statistics. Both are useful, but they tend to be more common in different research communities. And I think sometimes conversations get off track when people are using different ones.
Frame 1 is statistics as math and logic. I think many statisticians and quantitative psychologists work under this frame. Their goal is to understand statistical methods, and statistics are based on math and logic. In math and logic, things are absolute and provable. (Even in statistics, which deals with uncertainty, the uncertainty is almost always quantifiable, and thus subject to analysis.) In math and logic, exceptions and boundary cases are important. If I say “All A are B” and you disagree with me, all you need to do is show me one instance of an A that is not B and you’re done.
In the realm of statistics, that can mean either proving or demonstrating that a method breaks down under some conditions. A good example of this is E. J. Wagenmakers et al.’s recent demonstration that using intervals to do hypothesis testing is wrong. Many people (including me) have assumed that if the 95% confidence interval of a parameter excludes 0, that’s the same as falsifying the hypothesis “parameter = 0.” E. J. and colleagues show an instance where this isn’t true — that is, where the data are uninformative about a hypothesis, but the interval would lead you to believe you had evidence against it. In the example, a researcher is testing a hypothesis about a binomial probability and has a single observation. So the demonstrated breakdown occurs at N = 1, which is theoretically interesting but not a common scenario in real-world research applications.
Frame 2 is statistics as a tool. I think many scientists work under this frame. The scientist’s goal is to understand the natural world, and statistics are a tool that you use as part of the research process. Scientists are pragmatic about tools. None of our tools are perfect – lab equipment generates noisy observations and can break down, questionnaires are only good for some populations, etc. Better tools are better, of course, but since they’re never perfect, at some point we have to decide they’re good enough so we can get out and use them.
Viewing statistics as a tool means that you care whether or not something works well enough under the conditions in which you are likely to use it. A good example of a tool-frame analysis of statistics is Judd, Westfall, and Kenny’s demonstration that traditional repeated-measures ANOVA fails to account for the sampling of stimuli in many within-subjects designs, and that multilevel modeling with random effects is necessary to correctly model those effects. Judd et al. demonstrate this with data from their own published studies, showing in some cases that they themselves would have (and should have) reached different scientific conclusions.
I suspect that this difference in frames relates to the communications gap that Donald Sharpe identified between statisticians and methodologists. (Sharpe’s paper is well worth a read, whether you’re a statistician or a scientist.) Statistical discoveries and innovations often die a lonely death in Psychological Methods because quants prove something under Frame 1 but do not go the next step of demonstrating that it matters under Frame 2, so scientists don’t adopt it. (To be clear, I don’t think the stats people always stay in Frame 1 – as Sharpe points out, some of the most cited papers are in Psychological Methods too. Many of them are ones that speak to both frames.)
I also wonder if this might contribute to the prevalence of less-than-optimal research practices (LTORPs, which includes the things sometimes labeled p-hacking or questionable research practices / QRPs). I’m sure plenty of scientists really have (had?) no idea that flexible stopping rules, trying analyses with and without a covariate to see which way works better, etc. are a problem. But I bet others have some general sense that LTORPs are not theoretically correct, perhaps because their first-year grad stats instructors told them so (probably in a Frame 1-ey way). But they also know — perhaps because they have been told by the very same statistics instructors — that there are plenty of statistical practices that are technically wrong but not a big deal (e.g., some departures from distributional assumptions). Tools don’t have to be perfect, they just have to work for the problem at hand. Specifically, I suspect that for a long time, many scientists’ attitude has been that p-values do not have to be theoretically correct, they just have to lead people to make enough right decisions enough of the time. Take them seriously but not that seriously. So when faced with a situation that they haven’t been taught the exact tools for, scientists will weigh the problem as best as they can, and sometimes they tell themselves — rightly or wrongly — that what they’re doing is good enough, and they do it.
Sharpe makes excellent points about why there is a communication gap and what to do about it. I hope the 2 frames notion complements that. Scientists have to make progress with limited resources, which means they are constantly making implicit (and sometimes explicit) cost-benefit calculations. If adopting a statistical innovation will require time up front to learn it and perhaps additional time each time you implement it, researchers will ask themselves if it is worth it. Of all the things I need or want to do — writing grants, writing papers, training grad students, running experiments, publishing papers — how much less of the other stuff will I be able to do if I put the time and resources into this? Will this help me do better at my goal of discovering things about the natural world (which is different than the goal of the statistician, which is to figure out new things about statistics), or is this just a lot of headache that’ll lead me to mostly the same place that the simpler way would?
I have a couple of suggestions for better dialogue and progress on both sides. One is that we need to recognize that the 2 frames come from different sets of goals – statisticians want to understand statistics, scientists want to understand the natural world. Statisticians should go beyond showing that something is provably wrong or right, and address whether it is consequentially wrong or right. One person’s provable error is another person’s reasonable approximation. And scientists should consult statisticians about real-world consequences of their decisions. As much as possible, don’t assume good enough, verify it.
Scientists also need usable tools to solve their problems. Both conceptual tools, and more concrete things like software, procedures, etc. So scientists and statisticians need to talk more. I think data peeking is a good example of this. To a statistician, setting sample size a priori probably seems like a decent assumption. To a scientist who has just spent two years and six figures of grant money on a study and arrived at suggestive but not conclusive results (a.k.a. p = .11), it is laughable to suggest setting aside that dataset and starting from scratch with a larger sample. If you think your only choice is either do that or run another few subjects and do the analysis again, then if you think it’s just a minor fudge (“good enough”) you’re probably going to run the subjects. Sequential analyses solve that problem. They have been around for a while, but languishing in a small corner of the clinical trials literature where there was a pressing ethical reason to use them. Now that scientists are realizing they exist and can solve a wider range of problems, sequential analyses are starting to get much wider attention. They probably should be integrated even more into the data-analytic frameworks (and software) for expensive research areas, like fMRI.
Sharpe encourages statisticians to pick real examples. Let me add that they should be examples of research that you are motivated to help. Theorems, simulations, and toy examples are Frame 1 tools. Analyses in real data will hit home with scientists where they live in Frame 2. Picking apart a study in an area you already have distaste for (“I think evolutionary psych is garbage, let me choose this ev psych study to illustrate this statistical problem”) might feel satisfying, but it probably leads to less thorough and less persuasive critiques. Show in real data how the new method helps scientists with their goals, not just what the old one gets wrong according to yours.
I think of myself as one of what Sharpe calls the Mavens – scientists with an extra interest in statistics, who nerd out on quant stuff, often teach quantitative classes within their scientific field, and who like to adopt and spread the word about new innovations. Sometimes Mavens get into something because it just seems interesting. But often they (we) are drawn to things that seem like cool new ways to solve problems in our field. Speaking as a Maven who thinks statisticians can help us make science better, I would love it if you could help us out. We are interested, and we probably want to help too.
Sanjay,
As a fellow Maven with Statistical Methods as my minor in my Ph.D. (in Management) and most of my doctoral work in both areas in the Psychology Department, I’ve long advocated a shared language and greater overlapping interests between scientists and statisticians. I liken the former to cooks and the latter to chefs. Cooks take the best of what chefs design and adopt and adapt it for other dishes. Chefs make great stuff, but sometimes it’s just too complicated to adapt or adopt. In my efforts at being a better chef, I’ve enrolled in the Quantitative Methods program at UT-Austin part-time while I work as an Associate Professor at Texas State University. Anyway…I love your blog and always learn a lot from it. Keep it up, please!
Brian K. Miller
Thanks Brian!
I think this’ll strike a chord with many researchers, me included. Toy examples with a handful of coin tosses, 50% CIs etc. are all fine to demonstrate technical points. They even made me pick up an introduction to Bayesian methods – which I had to lay aside a couple of months later because it took huge amounts of time and wasn’t directly applicable to what I was doing anyway.
I don’t need any convincing that Bayesian approaches (or whatever statistical tool of your choosing) are better than the methods my colleagues and I are familiar with. I need to be convinced that they’re superior enough *for the problems I need to solve* to justify investing time to learn them. These problems don’t involve 6 coin tosses and 50% CIs. (They don’t involve simple two sample comparisons either.)
A related point is that my use of some heuristic doesn’t mean that I religiously stick to whatever inference they point to. I can see why in a strict statistical framework it makes little sense to develop or advocate methods whose outcomes can be brushed aside based on the researcher’s intuition, experience etc., and that it’s more coherent to explicitly model these expectations. But to subject-matter scientists it isn’t that big of a problem, I think.
This is a brilliant essay and brings into consciousness some issues I’ve had in the back of my mind for a long time, but never developed. The only thought I can add is that scientists can be faulted when they want statistics to tell them what to think, rather than to help them decide what to think. Big difference. In the first sense, we were taught to believe a finding if the p <.05 but to toss it out if p < .06. Most of us always sort-of suspected that any strict interpretation of probability theory here was probably incorrect, but we also thought it was "close enough" that we could make yes-no decisions using the rule. That nearly-fatal mistake continues to be made every day, and might do us all in yet.
We non-statisticians need to lower our expectations of what statistics can do for us. They can point out interesting almost-facts (hey, did you know that under a null assumption you'd only get data as extreme as you got 6 times out of 100?). That's not uninformative, but it shouldn't be decisive. We need to really look at our data, and interpret them in light of (among other considerations) prior findings, theoretical sensibility, general plausibility (another way to say: Think Bayesian without doing all that math), and what they might mean in the context of other findings in our own lab or elsewhere. We never quite trusted ourselves, much less each other, to do this so we turned to the .05 rule to make our decisions automatically. But the solution was illusory.