The usability of statistics; or, what happens when you think that (p=.05) != (p=.06)

The difference between significant and not significant is not itself significant.

That is the title of a 2006 paper by statisticians Andrew Gelman and Hal Stern. It is also the theme of a new review article in Nature Neuroscience by Sander Nieuwenhuis, Birte U Forstmann, and Eric-Jan Wagenmakers (via Gelman’s blog). The review examined several hundred papers in behavioral, systems, and cognitive neuroscience. Of all the papers that tried to compare two effects, about half of them made this error instead of properly testing for an interaction.

I don’t know how often the error makes it through to published papers in social and personality psychology, but I see it pretty regularly as a reviewer. I call it when I see it; sometimes other reviewers call it out too, sometimes they don’t.

I can also remember making this error as a grad student – and my advisor correcting me on it. But the funny thing is, it’s not something I was taught. I’m quite sure that nowhere along the way did any of my teachers say you can compare two effects by seeing if one is significant and the other isn’t. I just started doing it on my own. (And now I sometimes channel my old advisor and correct my own students on the same error, and I’m sure nobody’s teaching it to them either.)

If I wasn’t taught to make this error, where was I getting it from? When we talk about whether researchers have biases, usually we think of hot-button issues like political bias. But I think this reflects a more straightforward kind of bias — old habits of thinking that we carry with us into our professional work. To someone without scientific training, it seems like you should be able to ask “Does X cause Y, yes or no?” and expect a straightforward answer. Scientific training teaches us a couple of things. First, the question is too simple: it’s not a yes or no question; the answer is always going to come with some uncertainty; etc. Second, the logic behind the tool that most of us use – null hypothesis significance testing (NHST) – does not even approximate the form of the question. (Roughly: “In a world where X has zero effect on Y, would we see a result this far from the truth less than 5% of the time?”)

So I think what happens is that when we are taught the abstract logic of what we are doing, it doesn’t really pervade our thinking until it’s been ground into us through repetition. For a period of time – maybe in some cases forever – we carry out the mechanics of what we have been taught to do (run an ANOVA) but we map it onto our old habits of thinking (“Does X cause Y, yes or no?”). And then we elaborate and extrapolate from them in ways that are entirely sensible by their own internal logic (“One ANOVA was significant and the other wasn’t, so X causes Y more than it causes Z, right?”).

One of the arguments you sometimes hear against NHST is that it doesn’t reflect the way researchers think. It’s a sort of usability argument: NHST is the butterfly ballot of statistical methods. In principle, I don’t think that argument carries the day on its own (if we need to use methods and models that don’t track our intuitions, we should). But it should be part of the discussion. And importantly, the Nieuwenhuis et al. review shows us how using unintuitive methods can have real consequences.

Nick Kristof gets a B- social psych, and an incomplete in media studies

In today’s NYT, Nicholas Kristof writes about the implications of people choosing their own media sources. His argument: traditional newspapers present people with a wide spectrum of objective reporting. But when people choose their own news sources, they’ll gravitate toward voices that agree with their own ideology.

Along the way, Kristof sort of references research on confirmation bias and group polarization, though he doesn’t call them that, and weirdly he credits Harvard law professor Cass Sunstein for discovering group polarization.

But my main thought is this… Neither confirmation bias nor group polarization are new phenomena. Is it really true that people used to read and think about a broad spectrum of news and opinion? Or are we mis-remembering a supposedly golden era of objective reporting? Back when most big towns had multiple newspapers, you could pick the one that fit your ideology. You could subscribe to The Nation or National Review. You could buy books by Gore Vidal or William F. Buckley.

Plus, confirmation bias isn’t just about what information you choose to consume — it’s also about what you pay attention to, how you interpret it, and what you remember. Did everybody watch Murrow and Cronkite in the same way? Or did a liberal and a conservative watching the same newscast have a qualitatively different experience of it, by virtue of what they brought to the table?

No doubt things have changed a whole heck of a lot in the media, and they’re going to change a lot more. But I’m skeptical whenever I hear somebody argue that society is in decline because of some technological or cultural change. It’s a common narrative, but one that might be more poorly supported than we think.

Finally, a use for the heuristics and biases literature

How do you make a video game opponent realistically stupid?

A lot of attention in the artificial intelligence literature has gone into making computers as smart as possible. This has any number of pretty obvious applications: sorting through large datasets, improving decision-making, dishing out humility, destroying the human race.

But for game designers, a different problem has emerged: how to make a game opponent believably bad:

… People want to play against an opponent that is well matched to their skills, and so there are generally levels of AI in the game that the player can choose from. The simplest way to introduce stupidity into AI is to reduce the amount of computation that it’s allowed to perform. Chess AI generally performs billions of calculations when deciding what move to make. The more calculations that are made (and the more time taken), then (generally) the better the computer will play. If you reduce the amount of calculations performed, the computer will be a worse player. The problem with this approach is that it decreases the realism of the AI player. When you reduce the amount of computation, the AI will begin to make incredibly stupid mistakes — mistakes that are so stupid, no human would ever make them. The artificial nature of the game will then become apparent, which destroys the illusion of playing against a real opponent.

The approach being taken by game makers is to continue to make AI engines that are optimally rational — but then to introduce a probabilistic amount of realistic stupidity. For example, in poker, weak players are more likely to fold in the face of a large raise, even when the odds are in their favor. Game designers can incorporate this into creating “easy” opponents who are more likely (but not guaranteed) to fold when the human player raises.

So far, it appears that the game designers are using a pretty domain-specific approach — like modifying their poker AI based on the human errors that are common in poker. I wonder if additional traction could be gained from the broader psychology literature on heuristics. Heuristics are decision-making shortcuts that allow humans to make pretty good and highly efficient decisions across a wide range of important circumstances. But heuristics can also lead to biases that make us fall short of an optimal, rational expert, which is what most AI is programmed to be. Would game designers benefit from building their AI engines around prospect theory? Could you model the emotional states, and subsequently the appraisal tendencies, of computer opponents? Maybe someone is working on that already.